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Abstract 

We study a statistical model for the tensor principal component analysis problem 
introduced by Montanari and Richard: Given a order-3 tensor T of the form T = t-p ® 3 +A, 
where t > 0 is a signal-to-noise ratio, Vo is a unit vector, and A is a random noise tensor, 
the goal is to recover the planted vector Vq. For the case that A has iid standard Gaussian 
entries, we give an efficient algorithm to recover no whenever t > o)(h 3 ' 4 log(/z ) 1/4 ), 
and certify that the recovered vector is close to a maximum likelihood estimator, all 
with high probability over the random choice of A. The previous best algorithms with 
provable guarantees required t > Q(?z). 

In the regime t < o(n), natural tensor-unfolding-based spectral relaxations for the 
underlying optimization problem break down (in the sense that their integrality gap is 
large). To go beyond this barrier, we use convex relaxations based on the sum-of-squares 
method. Our recovery algorithm proceeds by rounding a degree-4 sum-of-squares 
relaxations of the maximum-likelihood-estimation problem for the statistical model. To 
complement our algorithmic results, we show that degree-4 sum-of-squares relaxations 
break down for t < 0(h 3/4 / ]og(/z) l /4 ), which demonstrates that improving our current 
guarantees (by more than logarithmic factors) would require new techniques or might 
even be intractable. 

Finally, we show how to exploit additional problem structure in order to solve our 
sum-of-squares relaxations, up to some approximation, very efficiently. Our fastest 
algorithm runs in nearly-linear time using shifted (matrix) power iteration and has 
similar guarantees as above. The analysis of this algorithm also confirms a variant of a 
conjecture of Montanari and Richard about singular vectors of tensor unfoldings. 
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1 Introduction 


Principal component analysis (pca), the process of identifying a direction of largest possible 
variance from a matrix of pairwise correlations, is among the most basic tools for data 
analysis in a wide range of disciplines. In recent years, variants of pca have been proposed 
that promise to give better statistical guarantees for many applications. These variants 
include restricting directions to the nonnegative orthant (nonnegative matrix factorization) 
or to directions that are sparse linear combinations of a fixed basis (sparse pca). Often 
we have access to not only pairwise but also higher-order correlations. In this case, an 
analog of pca is to find a direction with largest possible third moment or other higher-order 
moment (higher-order pca or tensor pca). 

All of these variants of pca share that the underlying optimization problem is NP-hard 
for general instances (often even if we allow approximation), whereas vanilla pca boils 
down to an efficient eigenvector computation for the input matrix. However, these 
hardness result are not predictive in statistical settings where inputs are drawn from 
particular families of distributions. Here efficient algorithm can often achieve much 
stronger guarantees than for general instances. Understanding the power and limitations 
of efficient algorithms for statistical models of NP-hard optimization problems is typically 
very challenging: it is not clear what kind of algorithms can exploit the additional structure 
afforded by statistical instances, but, at the same time, there are very few tools for reasoning 
about the computational complexity of statistical / average-case problems. (See [BR13] 
and [BKS13] for discussions about the computational complexity of statistical models for 
sparse pca and random constraint satisfaction problems.) 

We study a statistical model for the tensor principal component analysis problem introduced 
by [MR14] through the lens of a meta-algorithm called the sum-of-squares method, based 
on semidefinite programming. This method can capture a wide range of algorithmic 
techniques including linear programming and spectral algorithms. We show that this 
method can exploit the structure of statistical tensor pca instances in non-trivial ways and 
achieves guarantees that improve over the previous ones. On the other hand, we show 
that those guarantees are nearly tight if we restrict the complexity of the sum-of-squares 
meta-algorithm at a particular level. This result rules out better guarantees for a fairly 
wide range of potential algorithms. Finally, we develop techniques to turn algorithms 
based on the sum-of-squares meta-algorithm into algorithms that are truly efficient (and 
even easy to implement). 

Montanari and Richard propose the following statistical model 1 * for tensor pca. 

Problem 1.1 (Spiked Tensor Model for tensor pca. Asymmetric). Given an input tensor 
T = t • v ® 3 + A, where v e R" is an arbitrary unit vector, t > 0 is the signal-to-noise ratio, 
and A is a random noise tensor with iid standard Gaussian entries, recover the signal v 
approximately. 

1 Montanari and Richard use a different normalization for the signal-to-noise ratio. Using their notation, 

~ x/ yfn. 
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Montanari and Richard show that when x < o( fin) Problem 1.1 becomes information- 
theoretically unsolvable, while for x > a>( fin) the maximum likelihood estimator (MLE) 
recovers v' with (v,v') > 1 - o(l). 

The maximum-likelihood-estimator (MLE) problem for Problem 1.1 is an instance of 
the following meta-problem for k = 3 and /: x Yuijk ^ijk^Apk [MR14]. 

Problem 1.2. Given a homogeneous, degree-fc function / on R", find a unit vector v G R" 
so as to maximize f(v) approximately. 

For k = 2, this problem is just an eigenvector computation. Already for k = 3, it is 
NP-hard. Our algorithms proceed by relaxing Problem 1.2 to a convex problem. The 
latter can be solved either exactly or approximately (as will be the case of our faster 
algorithms). Under the Gaussian assumption on the noise in Problem 1.1, we show that 
for x > oj(n 3/4 log(n) 1/4 ) the relaxation does not substantially change the global optimum. 

Noise Symmetry. Montanari and Richard actually consider two variants of this model. 
The first we have already described. In the second, the noise is symmetrized, (to match the 
symmetry of potential signal tensors zA 3 ). 

Problem 1.3 (Spiked Tensor Model for tensor pca. Symmetric). Given an input tensor 
T = x • v® 3 + A, where v G R" is an arbitrary unit vector, x > 0 is the signal-to-noise ratio, 
and A is a random symmetric noise tensor—that is. Ay* = A n ^ n y) n (k) for any permutation 
7i—with otherwise iid standard Gaussian entries, recover the signal v approximately. 

It turns out that for our algorithms based on the sum-of-squares method, this kind 
of symmetrization is already built-in. Hence there is no difference between Problem 1.1 
and Problem 1.3 for those algorithms. For our faster algorithms, such symmetrization is 
not built in. Nonetheless, we show that a variant of our nearly-linear-time algorithm for 
Problem 1.1 also solves Problem 1.3 with matching guarantees. 

1.1 Results 

Sum-of-squares relaxation. We consider the degree-4 sum-of-squares relaxation for the 
MLE problem. (See Section 1.2 for a brief discussion about sum-of-squares. All necessary 
definitions are in Section 2. See [BS14] for more detailed discussion.) Note that the planted 
vector v has objective value (1 - o(l))x for the MLE problem with high probability (assuming 
x = Q( fin) which will always be the case for us). 

Theorem 1.4. There exists a polynomial-time algorithm based on the degree-4 sum-of-squares 
relaxation for the MLE problem that given an instance of Problem 1.1 or Problem 1.3 with 
x > n 3/4 (log«) 1/4 /e outputs a unit vector v' with (v,v ') > 1 - 0(e) with probability 1 - 0(n _1 °) 
over the randomness in the input. Furthermore, the algorithm works by rounding any solution to 
the relaxation with objective value at least (1 - o(l))x. Finally, the algorithm also certifies that all 
unit vectors bounded away from v' have objective value significantly smaller than x for the MLE 
problem Problem 1.2. 
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We complement the above algorithmic result by the following lower bound. 

Theorem 1.5 (Informal Version). There is x : N —» 1R with x < 0(n 3/4 / log(n) 1/4 ) so f/zaf when 
T zs an instance of Problem 1.1 zvith signal-to-noise ratio x, zvith probability 1 - 0(n _1 °), f/zere 
exists a solution to the degree-4 sum-of-squares relaxation for theMLE problem with objective value 
at least t that does not depend on the planted vector v. In particular, no algorithm can reliably 
recover from this solution a vector v' that is significantly correlated with v. 

Faster algorithms. We interpret a tensor-unfolding algorithm studied by Montanari and 
Richard as a spectral relaxation of the degree-4 sum-of-squares program for the MLE 
problem. This interpretation leads to an analysis that gives better guarantees in terms of 
signal-to-noise ratio t and also informs a more efficient implementation based on shifted 
matrix power iteration. 

Theorem 1.6. There exists an algorithm zvith running time 0(n 3 ), zvliich is linear in the size of 
the input, that given an instance of Problem 1.1 or Problem 1.3 zvith t > n 3/A /e outputs zvith 
probability 1 - 0(n~ 10 ) a unit vector v' with (v,v') > 1 - 0(e). 

We remark that unlike the previous polynomial-time algorithm this linear time algorithm 
does not come with a certification guarantee. In Section 4.1, we show that small adversarial 
perturbations can cause this algorithm to fail, whereas the previous algorithm is robust 
against such perturbations. We also devise an algorithm with the certification property 
and running time 0(n 4 ) (which is subquadratic in the size n 3 of the input). 

Theorem 1.7. There exists an algorithm with running time 0(n 4 ) (for inputs of size n 3 ) that 
given an instance of Problem 1.1 zvith t > ;z 3/4 (log n) ] /4 /c for some e, outputs zvith probability 
1 - 0(n~ w ) a unit vector v' with (v, v') > 1 - 0(e) and certifies that all vectors bounded away from 
v' have MLE objective value significantly less than t. 

Higher-order tensors. Our algorithmic results also extend in a straightforward way to 
tensors of order higher than 3. (See Section 7 for some details.) For simplicity we give 
some of these results only for the higher-order analogue of Problem 1.1; we conjecture 
however that all our results for Problem 1.3 generalize in similar fashion. 

Theorem 1.8. Let k be an odd integer, Vo e IR” a unit vector, t > n ,c/4 log(n) 1/4 /e, and A an 
order-k tensor with independent unit Gaussian entries. Let T(x) = x • (v 0 ,x) k + A(x). 

1. There is a polynomial-time algorithm, based on semidefinite programming, zvliich on input 
T(x) = x • (v 0/ x) k + A(x) returns a unit vector v zvith (v 0/ v) > 1 - 0(e) zvith probability 
1 - 0(n~ w ) over random choice of A. 

2. There is a polynomial-time algorithm, based on semidefinite programming, which on input 
T(x) = x • (v 0 ,x) k + A(x) certifies that T(x) < x • (v,x) k + 0(n k/ 4 log (n) 1,4 )for some unit v 
with probability 1 - 0(n~ w ) over random choice of A. This guarantees in particular that v is 
close to a maximum likelihood estimator for the problem of recovering the signal v 0 from the 
input x • v® k + A. 
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3. By solving the semidefinite relaxation approximately, both algorithms can be implemented in 
time 0(m 1+1/k ), where m = n k is the input size. 

For even k, the above all hold, except now we recover v with {Vo, v) 2 > 1 - 0(e), and the algorithms 
can be implemented in nearly linear time. 

Remark 1.9. When A is a symmetric noise tensor (the higher-order analogue of Problem 1.3), 
(1-2) above hold. We conjecture that (3) does as well. 

The last theorem, the higher-order generalization of Theorem 1.6, almost completely 
resolves a conjecture of Montanari and Richard regarding tensor unfolding algorithms for 
odd k. We are able to prove their conjectured signal-to-noise ratio t for an algorithm that 
works mainly by using an unfolding of the input tensor, but our algorithm includes an 
extra random-rotation step to handle sparse signals. We conjecture but cannot prove that 
the necessity of this step is an artifact of the analysis. 

Theorem 1.10. Let k be an odd integer, v 0 e IR" a unit vector, z > n fc/4 /e, and A an order-k tensor 
with independent unit Gaussian entries. There is a nearly-linear-time algorithm, based on tensor 
unfolding, which, zvith probability 1 - 0(n~ w ) over random choice of A, recovers a vector v with 
(v,v 0 ) 2 > 1 - 0(e). This continues to hold when A is replaced by a symmetric noise tensor (the 
higher-order analogue of Problem 1.3). 

1.2 Techniques 

We arrive at our results via an analysis of Problem 1.2 for the function f(x) = '£ jjjk T l j k x i x ] x k , 
where T is an instance of Problem 1.1. The function / decomposes as / = g + h for a signal 
g(x) = t • (v,x) 3 and noise h(x) = Yjijk a ijkXiXjXj where {ajj k } are iid standard Gaussians. The 
signal g is maximized at x - v, where it takes the value t. The noise part, h, is with high 
probability at most 0( yfn) over the unit sphere. We have insisted that z be much greater 
than yfn, so / has a unique global maximum, dominated by the signal g. The main problem 
is to find it. 

To maximize g, we apply the Sum-of-Squares meta-algorithm (SoS). SoS provides a 
hierarchy of strong convex relaxations of Problem 1.2. Using convex duality, we can recast 
the optimization problem as one of efficiently certifying the upper bound on h which shows 
that optima of g are dominated by the signal. SoS efficiently finds boundedness certificates 
for h of the form 

c - h(x) = Si(x) 2 + • • • + s k (x) 2 

where "=" denotes equality in the ring R[x]/(||x|| 2 - 1) and where S\,.. . ,s k have bounded 
degree, when such certificates exist. (The polynomials {s z } and {tf certify that h(x) < c. 
Otherwise c - h(x) would be negative, but this is impossible by the nonnegativity of squared 
polynomials.) 

Our main technical contribution is an almost-complete characterization of certificates 
like these for such degree-3 random polynomials h when the polynomials {s, j have degree 
at most four. In particular, we show that with high probability in the random case a 
degree-4 certificate exists for c = 0()W 4 ), and that with high probability, no significantly 
better degree-four certificate exists. 
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Algorithms. We apply this characterization in three ways to obtain three different 
algorithms. The first application is a polynomial-time based on semidefinite programming 
algorithm that maximizes / when t > Q(jz 3,/4 ) (and thus solves TPCA in the spiked tensor 
model for t > Q(w 3 / 4 ).) This first algorithm involves solving a large semidefinite program 
associated to the SoS relaxation. As a second application of this characterization, we avoid 
solving the semidefinite program. Instead, we give an algorithm running in time 0(w 4 ) 
which quickly constructs only a small portion of an almost-optimal SoS boundedness 
certificate; in the random case this turns out to be enough to find the signal v and certify 
the boundedness of g. (Note that this running time is only a factor of n polylog n greater 
than the input size ft 3 .) 

Finally, we analyze a third algorithm for TPCA which simply computes the highest 
singular vector of a matrix unfolding of the input tensor. This algorithm was considered 
in depth by Montanari and Richard, who fully characterized its behavior in the case of 
even-order tensors (corresponding to k = 4,6,8,... in Problem 1.2). They conjectured that 
this algorithm successfully recovers the signal v at the signal-to-noise ratio z of Theorem 1.7 
for Problem 1.1 and Problem 1.3. Up to an extra random rotations step before the tensor 
unfolding in the case that the input comes from Problem 1.3 (and up to logarithmic factors 
in t) we confirm their conjecture. We observe that their algorithm can be viewed as a 
method of rounding a non-optimal solution to the SoS relaxation to find the signal. We 
show, also, that for k - 4, the degree-4 SoS relaxation does no better than the simpler tensor 
unfolding algorithm as far as signal-to-noise ratio is concerned. However, for odd-order 
tensors this unfolding algorithm does not certify its own success in the way our other 
algorithms do. 

Lower Bounds. In Theorem 1.5, we show that degree-4 SoS cannot certify that the noise 
polynomial A(x) = Yjijk a ijk x i x j x k for iid standard Gaussians satisfies A(x) < o(ft 3/4 ). 

To show that SoS certificates do not exist we construct a corresponding dual object. Here 
the dual object is a degree-4 pseudo-expectation: a linear map E : IR[x] <4 —> IR pretending to 
give the expected value of polynomials of degree at most 4 under some distribution on 
the unit sphere. "Pretending" here means that, just like an actual distribution, E p(x) 2 > 0 
for any p of degree at most 4. In other words, E is positive semidefinite on degree 4 
polynomials. While for any actual distribution over the unit sphere E A(x) < 0( \fn), we 
will give E so that E A(x) > Q(ft 3 ^ 4 ). 

To ensure that E A(x) > Q(ft 3,/4 ), for monomials XjXjX^ of degree 3 we take E x ; x,x/ c ~ 
px^) a ijk- For polynomials p of degree at most 2 it turns out to be enough to set E p(x) ~ E f ' p(x) 
where E fI denotes the expectation under the uniform distribution on the unit sphere. 

Having guessed these degree 1,2 and 3 pseudo-moments, we need to define E x,x ; xyxy 
so that E is PSD. Representing E as a large block matrix, the Schur complement criterion for 
PSDness can be viewed as a method for turning candidate degree 1-3 moments (which here 
lie on upper-left and off-diagonal blocks) into a candidate matrix M E IR' rx,r of degree-4 
pseudo-expectation values which, if used to fill out the degee-4 part of E, would make it 
PSD. 

We would thus like to set IE x,X/WG = M[(z, j), (k, /)]. Unfortunately, these candidate 
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degree-4 moments M[(i, j), ( k , /)] do not satisfy commutativity; that is, we might have 
M[(z, j), (/c, /)] + M[(z, k), ( j , (:)\ (for example). But a valid pseudo-expectation must satisfy 
E XjXjXkXf = IE XiXkXjXf. To fix this, we average out the noncommutativity by setting 
ID XjXjX k X{ = jd-| E ne< 5 4 M[(7i(z),7i(/)), (7z(/c), 7i(^))], where <S 4 is the symmetric group on 4 
elements. 

This ensures that the candidate degree-4 pseudo-expectation ID satisfies commutativity, 
but it introduces a new problem. While the matrix M from the Schur complement was 
guaranteed to be PSD and even to make E PSD when used as its degree-4 part, some of the 
permutations n ■ M given by (71 • M)[(z, j), (k, £)\ = M[(n(i), n(j)), (n(k), n(£))\ need not even 
be PSD themselves. This means that, while ID avoids having large negative eigenvalues 
(since it is correlated with M from Schur complement), it will have some small negative 
eigenvalues; i.e. IDp(x) 2 < 0 for some p. 

For each permutation n ■ M we track the most negative eigenvalue A m i„(n ■ M) using 
matrix concentration inequalities. After averaging the permutations together to form E) 
and adding this to E to give a linear functional E + ID on polynomials of degree at most 4, 
our final task is to remove these small negative eigenvalues. For this we mix E + ID with p, 
the uniform distribution on the unit sphere. Since E 1 " has eigenvalues bounded away from 
zero, our final pseudo-expectation 

E r p(x) = f e • E p(x) + e ■ ID p(x) + (1 - e) • E f ' p(x) 

degree 1-3 pseudo-expectations degree 4 pseudo-expectations fix negative eigenvalues 

is PSD for e small enough. Having tracked the magnitude of the negative eigenvalues 
of ID, we are able to show that e here can be taken large enough to get E' A(x) = Cl(n 3/4 ), 
which will prove Theorem 1.5. 

1.3 Related Work 

There is a vast literature on tensor analogues of linear algebra problems—too vast to 
attempt any survey here. Tensor methods for machine learning, in particular for learning 
latent variable models, have garnered recent attention, e.g., with works of Anandkumar 
et al. [AGH + 14, AGHK13]. These approaches generally involve decomposing a tensor 
which captures some aggregate statistics of input data into rank-one components. A recent 
series of papers analyzes the tensor power method, a direct analogue of the matrix power 
method, as a way to find rank-one components of random-case tensors [AGJ14b, AGJ14a]. 

Another recent line of work applies the Sum of Squares (a.k.a. Lasserre or 
Lasserre/Parrilo) hierarchy of convex relaxations to learning problems. See the sur¬ 
vey of Barak and Steurer for references and discussion of these relaxations [BS14], Barak, 
Kelner, and Steurer show how to use SoS to efficiently find sparse vectors planted in 
random linear subspaces, and the same authors give an algorithm for dictionary learning 
with strong provable statistical guarantees [BKS14b, BKS14a]. These algorithms, too, 
proceed by decomposition of an underlying random tensor; they exploit the strong (in 
many cases, the strongest-known) algorithmic guarantees offered by SoS for this problem 
in a variety of average-case settings. 
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Concurrently and independently of us, and also inspired by the recently-discovered 
applicability of tensor and sum-of-squares methods to machine learning, Barak and Moitra 
use SoS techniques formally related to ours to address the tensor prediction problem: given 
a low-rank tensor (perhaps measured with noise) only a subset of whose entries are 
revealed, predict the rest of the tensor entries [BM15]. They work with worst-case noise 
and study the number of revealed entries necessary for the SoS hierarchy to successfully 
predict the tensor. By constrast, in our setting, the entire tensor is revealed, and we 
study the signal-to-noise threshold necessary for SoS to recover its principal component 
under distributional assumptions on the noise that allow us to avoid worst-case hardness 
behavior. 

Since Barak and Moitra work in a setting where few tensor entries are revealed, they 
are able to use algorithmic techniques and lower bounds from the study of sparse random 
constraint satisfaction problems (CSPs), in particular random 3XOR [GK01, FGK05, FO07, 
FKO06]. The tensors we study are much denser. In spite of the density (and even though 
our setting is real-valued), our algorithmic techniques are related to the same spectral 
refutations of random CSPs. However our lower bound techniques do not seem to be 
related to the proof-complexity techniques that go into sum-of-squares lower bound results 
for random CSPs. 

The analysis of tractable tensor decomposition in the rank one plus noise model that 
we consider here (the spiked tensor model ) was initiated by Montanari and Richard, whose 
work inspired the current paper [MR14], They analyze a number of natural algorithms and 
find that tensor unfolding algorithms, which use the spectrum of a matrix unfolding of the 
input tensor, are most robust to noise. Here we consider more powerful convex relaxations, 
and in the process we tighten Montanari and Richard's analysis of tensor unfolding in the 
case of odd-order tensors. In concurrent and independent work, Zheng and Tomioka also 
give a tight analysis of tensor unfolding for the asymmetric version of the spiked model of 
tensor pca (Problem 1.1) [ZT15, Theorem 1], 

Related to our lower bound, Montanari, Reichman, and Zeitouni (MRZ) prove strong 
impossibility results for the problem of detecting rank-one perturbations of Gaussian 
matrices and tensors using any eigenvalue of the matrix or unfolded tensor; they are able 
to characterize the precise threshold below which the entire spectrum of a perturbed noise 
matrix or unfolded tensor becomes indistinguishable from pure noise [MRZ14], This lower 
bound is incomparable to our lower bound for the degree-4 SoS relaxation. The MRZ lower 
bound considers fine-grained information about the spectrum of a single matrix associated 
with the detection problem. Our lower bound considers coarser information (just the top 
eigenvalue) but it applies to a wide range of matrices associated with the problem (all 
matrices generated via the degree-4 sum-of-squares proof system). 
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2 Preliminaries 


2.1 Notation 

We use x = (xi,..., x n ) to denote a vector of indeterminates. The letters u, v, w are generally 
reserved for real vectors. The letters a, /3 are reserved for multi-indices; that is, for tuples 
(z'i,..., 4) of indices. For /, g : N —> R we write f < g for / = 0(g) and / > y for / = Q(z 7 ). 
We write / = 0(y) if f(n ) < g{n) ■ polylog w, and / = Q(g) if / > ^(n)/ polylog ft. 

We employ the usual Loewner (a.k.a. positive semi-definite) ordering > on Hermitian 
matrices. 

We will be heavily concerned with tensors and matrix flattenings thereof. In general, 
boldface capital letters T denote tensors and ordinary capital letters denote matrices A. 
We adopt the convention that unless otherwise noted for a tensor T the matrix T is the 
squarest-possible unfolding of T. If T has even order k then T has dimensions /W 2 X ft ,c A 
For odd k it has dimensions x ftW 2 !. All tensors, matrices, vectors, and scalars in this 
paper are real. 

We use (•, •) to denote the usual entrywise inner product of vectors, matrices, and 
tensors. For a vector v, we use ||u|| to denote its £2 norm. For a matrix A, we use ||A|| to 
denote its operator norm (also known as the spectral or f 2 -to -£2 norm). 

For a /c-tensor T, we write T(z>) for (v® k , T). Thus, T(x) is a homogeneous real polynomial 
of degree k. 

We use Sk to denote the symmetric group on k elements. For a /c-tensor T and n e <S/t, 
we denote by T 71 the /c-tensor with indices permuted according to n, so that T" = T n -i (a) . A 
tensor T is symmetric if for all n £ Sk it is the case that T n = T. (Such tensors are sometimes 
called "supersymmetric.") 

For clarity, most of our presentation focuses on 3-tensors. For an n x n 3-tensor T, we 
use Tj to denote its n x n matrix slices along the first mode, i.e., {Ti) h k = T,^. 

We often say that an sequence {E n } neN of events occurs with high probability, which 
for us means that P(E tt fails) = 0(ft -1 °). (Any other n~ c would do, with appropriate 
modifications of constants elsewhere.) 

2.2 Polynomials and Matrices 

Let lR[x]^ be the vector space of polynomials with real coefficients in variables x = 
(xi,... ,x„), of degree at most d. We can represent a homogeneous even-degree polynomial 
p £ lR[x] t i by an n d/2 x n d/2 matrix: a matrix M is a matrix representation for p if p(x) = 
(x® d/2 r Mx® d/1 ). If p has a matrix representation M > 0, then p = Pi( x ) 2 for some 
polynomials p t . 

2.3 The Sum of Squares (SoS) Algorithm 

Definition 2.1. Let £.: lR[x]^ —> IRbe a linear functional on polynomials of degree at most 
d for some d even. Suppose that 
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• £1 = 1 . 


• £p(x) 2 > 0 for all p G 1R[x]^/ 2 . 

Then £ is a degr ee-d pseudo-expectation. We often use the suggestive notation IE for such 
a functional, and think of IE p(x) as giving the expectation of the polynomial p(x) under a 
pseudo-distribution over {x}. 

For p G R[x] <d we say that the pseudo-distribution {x} (or, equivalently, the functional 
E) satisfies {p(x) = 0} if E p{x)q{x) = 0 for all q(pc) such that p(x)q(x) G IR[x] <rf . 

Pseudo-distributions were first introduced in [BBH + 12] and are surveyed in [BS14], 

We employ the standard result that, up to negligible issues of numerical accuracy, if 
there exists a degree-d pseudo-distribution satisfying constraints |p 0 (x) = 0,... ,p m (x) = 0}, 
then it can be found in time n°^ by solving a semidefinite program of size n 0(d K (See 
[BS14] for references.) 

3 Certifying Bounds on Random Polynomials 

Let / G HR[x],i be a homogeneous degree-d polynomial. When d is even, / has square matrix 
representations of dimension /W 2 x jW 2 . The maximal eigenvalue of a matrix representation 
M of / provides a natural certifiable upper bound on max|| 0 || =1 f(v), as 

/( v) = (v® d/2 ,Mv® dl2 ) < max = || M ||. 

wem dl2 (w,w) 

When f(x) = A(x) for an even-order tensor A with independent random entries, the quality 
of this certificate is well characterized by random matrix theory. In the case where the 
entries of A are standard Gaussians, for instance, ||M|| = ||A + A T || < 0(n d,4r ) with high 
probability, thus certifying that max||„n =1 f(v) < 0{n d/4: ). 

A similar story applies to / of odd degree with random coefficients, but with a catch: 
the certificates are not as good. For example, we expect a degree-3 random polynomial 
to be a smaller and simpler object than one of degree-4, and so we should be able to 
certify a tighter upper bound on max|| z ,|| =1 f(v). The matrix representations of / are now 
rectangular n 2 X n matrices whose top singular values are certifiable upper bounds on 
max||j,|| = i f(v). But in random matrix theory, this maximum singular value depends (to 
a first approximation) only on the longer dimension n 2 , which is the same here as in 
the degree-4 case. Again when f(x) = A(x), this time where A is an order-3 tensor of 
independent standard Gaussian entries, ||M|| = V||AA T || > Q(jz), so that this method cannot 
certify better than maxu^! f(v) < 0(n). Thus, the natural spectral certificates are unable to 
exploit the decrease in degree from 4 to 3 to improve the certified bounds. 

To better exploit the benefits of square matrices, we bound the maxima of degree- 
3 homogeneous / by a degree-4 polynomial. In the case that / is multi-linear, we 
have the polynomial identity /(x) = |(x,V/(x)). Using Cauchy-Schwarz, we then get 
/(x) < |||x||||V/(x)||. This inequality suggests using the degree-4 polynomial ||V/(x)|| 2 as 
a bound on /. Note that local optima of / on the sphere occur where V/(z>) oc v, and so 
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this bound is tight at local maxima. Given a random homogeneous /, we will associate 
a degree-4 polynomial related to ||V/|| 2 and show that this polynomial yields the best 
possible degree-4 SoS-certifiable bound on maxu^! f(v). 

Definition 3.1. Let / £ IR[x] 3 be a homogeneous degree-3 polynomial with indeterminates 
x = (xi,... ,x n ). Suppose A\,... ,A n are matrices such that / = f Jl x,(x, A,x). We say that / is 
A-bounded if there are matrices A lr ... ,A n as above and a matrix representation M of ||x|| 4 
so that Yji A ® Aj < A 2 • M. 

We observe that for / multi-linear in the coordinates x, of x, up to a constant factor we 
may take the matrices A, to be matrix representations of d,f, so that A, (8) A, is a matrix 
representation of the polynomial ||V/|| 2 . This choice of A; may not, however, yield the 
optimal spectral bound A 2 . 

The following theorem is the reason for our definition of A-boundedness. 

Theorem 3.2. Let f £ 1R[x] 3 be A-bounded. Then max|| 0 || =1 f(v) < A, and the degree-4 SoS 
algorithm certifies this. In particular, every degree-4 pseudo-distribution {x} over ]R i! satisfies 

E/< A • (]E||x|| 4 ) 3/4 . 

Proof. By Cauchy-Schwarz for pseudo-expectations, the pseudo-distribution satisfies 
(]E||x|| 2 ) < E||x|| 4 and (]E x^x, A,x)) < (e^Lx 2 ) • (L ; (x, A,x) 2 ). Therefore, 

E / = E ^. Xi ■ (x, A,x) 

< (® Z /^) 1/2 ' (® X ^ X,Ax)2 ) 1/2 

= (E||x|| 2 ) ' ■ (E<x® 2 , (A; <8 A ; )x® 2 )) ' 

< (E||x|| 4 ) 1/4 ■ (lE<x® 2 ,A 2 -Mx* 2 )) 1/2 

= A ■ (e||x|| 4 ) 3/4 . 

The last inequality also uses the premise QL A, <8 A,) < A 2 -M for some matrix representation 
M of ||x|| 4 , in the following way Since M' := A 2 • M - (Jh A, 0 A/) > 0, the polynomial 
(xAM'x® 2 ) is a sum of squared polynomials. Thus, E(xAM'x® 2 ) > 0 and the desired 
inequality follows. □ 

We now state the degree-3 case of a general A-boundedness fact for homogeneous 
polynomials with random coefficients. The SoS-certifiable bound for a random degree-3 
polynomial this provides is the backbone of our SoS algorithm for tensor PC A in the spiked 
tensor model. 

Theorem 3.3. Let Abe a 3-tensor with independent entries from N(0, 1). Then A(x) is A-bounded 
with A = 0(n 3/ 4 logA) 1 ^ 4 ), with high probability. 

The full statement and proof of this theorem, generalized to arbitrary-degree homo¬ 
geneous polynomials, may be found as Theorem B.5; we prove the statement above as a 
corollary in Section B. Here provide a proof sketch. 
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Proof sketch. We first note that the matrix slices A, of A satisfy A(x) = JT x;(x, A,x). Using the 
matrix Bernstein inequality, we show that JT A*® A* -E JT A,-® A,- < 0(n 3 ^ 2 (log n) 1 ^ 2 ) • Id with 
high probability At the same time, a straightforward computation shows that jj E JT A,® A, 
is a matrix representation of ||x|| 4 . Since Id is as well, we get that JT A,- ® A, < A 2 • M, 
where M is some matrix representation of ||x|| 4 which combines Id and E JT A; ® A„ and 
A = 0(n 3 / 4 (logn)b 4 ). □ 

Corollary 3.4. Let Abe a 3-tensor with independent entries from N( 0,1). Then, with high proba¬ 
bility, the degree- 4 SoS algorithm certifies that max||„n =1 A(v) < 0(n 3/4 (logn) 1/4 ). Furthermore, 
also with high probability, every pseudo-distribution {x} over R" satisfies 

E A(x) < 0(n 3/4 (logn) 1/4 )(E||x|| 4 ) 3/4 . 

Proof Immediate by combining Theorem 3.3 with Theorem 3.2. □ 


4 Polynomial-Time Recovery via Sum of Squares 

Here we give our first algorithm for tensor PC A: we analyze the quality of the natural 
SoS relaxation of tensor PCA using our previous discussion of boundedness certificates 
for random polynomials, and we show how to round this relaxation. We discuss also the 
robustness of the SoS-based algorithm to some amount of additional worst-case noise in the 
input. For now, to obtain a solution to the SoS relaxation we will solve a large semidefinite 
program. Thus, the algorithm discussed here is not yet enough to prove Theorem 1.7 and 
Corollary 1.7: the running time, while still polynomial, is somewhat greater than 0(n 4 ). 


Tensor PCA with Semidefinite Programming 

Input: T = x • u® 3 + A, where v G R" and A is some order-3 tensor. 

Goal: Find v G R" with \(v,Vo)\ > 1 - o(l). 

Algorithm 4.1 (Recovery). Using semidefinite programming, find the degree-4 pseudo¬ 
distribution {x} satisfying {||x|| 2 = 1} which maximizes ET(x). Output Ex/|| Ex||. 

Algorithm 4.2 (Certification). Run Algorithm 4.1 to obtain v. Using semidefinite program¬ 
ming, find the degree-4 pseudo-distribution {x} satisfying {||x|| = 1} which maximizes 
ET(x)-t-(u,x) 3 . IfET(x)-x-(u,x) 3 < 0(n 3/4 log(n) 1/4 ), output certify. Otherwise, output 

FAIL. 

The following theorem characterizes the success of Algorithm 4.1 and Algorithm 4.2 

Theorem 4.3 (Formal version of Theorem 1.4). Let T = t ■ u® 3 + A, where v Q g R" and A 
has independent entries from N(0, 1). Let t > n 3/4 log(n) 1/4 /e. Then with high probability over 
random choice of A, on input T or T' := t • v® 3 + A-j Yj n eS 3 A 71 / Algorithm 4.1 outputs a vector 
v with (v,v 0 ) > 1 — 0(e). In other zvords,for this x. Algorithm 4.1 solves both Problem 1.1 and 
Problem 1.3. 
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For any unit v 0 £ R" and A, if Algorithm 4.2 outputs certify then T(x) < t • (z?,x) 3 + 
0(n 3/4 log(n) 1/4 ). For A as described in either Problem 1.1 or Problem 1.3 and t > n 3/4 log(iz) 1/4 /e. 
Algorithm 4.2 outputs certify with high probability. 

The analysis has two parts. We show that 

1. if there exists a sufficiently good upper bound on A(x) (or in the case of the symmetric 
noise input, on A"(x) for every n £ Sf which is degree-4 SoS certifiable, then the 
vector recovered by the algorithm will be very close to v, and that 

2. in the case of A with independent entries from N(0, 1), such a bound exists with high 
probability 

Conveniently, Item 2 is precisely the content of Corollary 3.4. The following lemma 
expresses Item 1. 

Lemma 4.4. Suppose A(x) £ R[x] 3 is such that |EA(x)| < ex • (E ||x|| 4 ) 3/4 for any degree-4 
pseudo-distribution {x}. Fhen on input t ■ vff + A, Algorithm 4.1 outputs a unit vector v with 
(v,v 0 ) Si- 0(e). 

Proof. Algorithm 4.1 outputs v = Ex/||Ex|| for the pseudo-distribution that it finds, so 
we'd like to show (zy, Ex/|| Ex||) Si- 0(e). By pseudo-Cauchy-Schwarz (Lemma A. 2), 
|| E x|| 2 < E ||x|| 2 = 1, so it will suffice to prove just that ( v 0 , Ex) > 1 - 0(e). 

If E(u 0 /*) 3 > 1 - 0(e), then by Lemma A.5 (and linearity of pseudo-expectation) we 
would have 

(uo,Ex) = E(y 0 ,x) > 1 - 0(2e) = 1 - 0(e) 

So it suffices to show that E(u 0 / x) 3 is close to 1. 

Recall that Algorithm 4.1 finds a pseudo-distribution that maximizes ET(x). We split 
ET(x) into the signal E(Up 3 ,x' 8 ’ 3 ) and noise E A(x) components and use our hypothesized 
SoS upper bound on the noise. 

ET(x) = t ■ (E(Uq 3 ,x ( 8 ’ 3 )) + E A(x) < T-(E<uf,x® 3 » + eT. 

Rewriting <z^® 3 , x® 3 ) as (iy, x) 3 , we obtain 

E(u 0 /^) 3 > --ET(x)-e. 

T 

Finally, there exists a pseudo-distribution that achieves ET(x) St- ct. Indeed, the 
trivial distribution giving probability 1 to v 0 is such a pseudo-distribution: 

T(u 0 ) = T + A(u 0 ) > T — £T. 


Putting it together, 

E(u 0 /X) 3 > --ET(x)-e > ——— -e = l-0(e). 

T T 


□ 
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Proof of Theorem 4.3. We first address Algorithm 4.1. Let t,T, T' be as in the theorem 
statement. By Lemma 4.4, it will be enough to show that with high probability every 
degree-4 pseudo-distribution {x} has E A(x) < £'t-(E ||x|| 4 ) 3/4 and E A"(x) < £'t-(E ||x|| 4 ) 3/4 
for some s' = 0(e). By Corollary 3.4 and our assumptions on z this happens for each 
permutation A n individually with high probability, so a union bound over A" for n G S$ 
completes the proof. 

Turning to Algorithm 4.2, the simple fact that SoS only certifies true upper bounds 
implies that the algorithm is never wrong when it outputs certify. It is not hard to see that 
whenever Algorithm 4.1 has succeeded in recovering v because E A(x) is bounded, which 
as above happens with high probability. Algorithm 4.2 will output certify. □ 

4.1 Semi-Random Tensor PCA 

We discuss here a modified TPCA model, which will illustrate the qualitative differences 
between the new tensor PCA algorithms we propose in this paper and previously-known 
algorithms. The model is semi-random and semi-adversarial. Such models are often 
used in average-case complexity theory to distinguish between algorithms which work by 
solving robust maximum-likelihood-style problems and those which work by exploiting 
some more fragile property of a particular choice of input distribution. 

Problem 4.5 (Tensor PCA in the Semi-Random Model). Let T = t • y® 3 + A, where v 0 € IR" 
and A has independent entries from A1(0,1). Let Q e IR' !X " with || Id - Q|| < 0(n~ x ^), chosen 
adversarially depending on T. Let T' be the 3-tensor whose n 2 x n matrix flattening is TQ. 
(That is, each row of T has been multiplied by a matrix which is close to identity.) On input 
T', recover v. 

Here we show that Algorithm 4.1 succeeds in recovering v in the semi-random model. 

Theorem 4.6. LetT' be the semi-random-model tensor PCA input, with z > n 3/4 log(n) 1/4 /£. With 
high probability over randomness in T', Algorithm 4.1 outputs a vector v with (v,Vq) > 1 - 0(e). 

Proof. By Lemma 4.4, it will suffice to show that B := (T'-T-u® 3 )hasEB(x) < £'t-(E ||x|| 4 ) 3 / 4 
for any degree-4 pseudo-distribution {x}, for some e' = ©(e). We rewrite B as 

B = (A + t • v Q (v 0 ® A) ) T )(Q - Id) + A 

where A has independent entries from 7V(0,1). Let {x} be a degree-4 pseudo¬ 
distribution. Let f(x) = (x® 2 , (A + t • i>o(A) ® v 0 ) T )(Q - Id)x). By Corollary 3.4, 
EB(x) = E/(x) + 0(n 3 l 4 log(n) 1//4 )(E Hxll 4 ) 3 / 4 with high probability. By triangle inequality 
and sub-multiplicativity of the operator norm, we get that with high probability 

ll(A + t • v„(v„ ® v„))(Q - Id)|| « (||A|| + t)||Q - Id II « 0(n 3/4 ), 

where we have also used Lemma B.4 to bound ||A|| < 0(n) with high probability and our 
assumptions on z and ||Q — Id ||. By an argument similar to that in the proof of Theorem 3.2 
(which may be found in Lemma A.6), this yields E/(x) < 0(n 3 ^ 4 )(E Hxll 4 ) 3 / 4 as desired. □ 
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5 Linear Time Recovery via Further Relaxation 

We now attack the problem of speeding up the algorithm from the preceding section. 
We would like to avoid solving a large semidefinite program to optimality: our goal is 
to instead use much faster linear-algebraic computations—in particular, we will recover 
the tensor PCA signal vector by performing a single singular vector computation on a 
relatively small matrix. This will complete the proofs of Theorem 1.7 and Theorem 1.6, 
yielding the desired running time. 

Our SoS algorithm in the preceding section turned on the existence of the A-boundedness 
certificate A, <g> A,, where A, are the slices of a random tensor A. Let T = t ■ vff’ + A be the 
spiked-tensor input to tensor PCA. We could look at the matrix JT T, ® T, as a candidate 
A-boundedness certificate for T(x). The spectrum of this matrix must not admit the spectral 
bound that £,• A; 0 A, does, because T(x) is not globally bounded: it has a large global 
maximum near the signal v. This maximum plants a single large singular value in the 
spectrum of T, ® T,. The associated singular vector is readily decoded to recover the 
signal. 

Before stating and analyzing this fast linear-algebraic algorithm, we situate it more 
firmly in the SoS framework. In the following, we discuss spectral SoS, a convex relaxation 
of Problem 1.2 obtained by weakening the full-power SoS relaxation. We show that the 
spectrum of the aforementioned £7 T, ® T, can be viewed as approximately solving the 
spectral SoS relaxation. This gives the fast, certifying algorithm of Theorem 1.7. We also 
interpret the tensor unfolding algorithm given by Montanari and Richard for TPCA in 
the spiked tensor model as giving a more subtle approximate solution to the spectral SoS 
relaxation. We prove a conjecture by those authors that the algorithm successfully recovers 
the TPCA signal at the same signal-to-noise ratio as our other algorithms, up to a small 
pre-processing step in the algorithm; this proves Theorem 1.6 [MR14], This last algorithm, 
however, succeeds for somewhat different reasons than the others, and we will show that 
it consequently fails to certify its own success and that it is not robust to a certain kind of 
semi-adversarial choice of noise. 

5.1 The Spectral SoS Relaxation 

5.1.1 The SoS Algorithm: Matrix View 

To obtain spectral SoS, the convex relaxation of Problem 1.2 which we will be able to 
(approximately) solve quickly in the random case, we first need to return to the full-strength 
SoS relaxation and examine it from a more linear-algebraic standpoint. 

We have seen in Section 2.2 that a homogeneous p £ IR[x] 2[ f may be represented as 
an n d X n d matrix whose entries correspond to coefficients of p. A similar fact is true for 
non-homogeneous p. Let #tuples(d) = l+n + n 2 + — f n d/2 . Letx® <d/2 := (x®°,x,x® 2 ,... ,x® d/2 ). 
Then p £ lR[x]^ can be represented as an #tuples(d) X #tuples(d) matrix; we say a matrix 
M of these dimensions is a matrix representation of p if {x < ® d ^ 2 ,Mx < ® d ^ 2 ) = p(x). For this 
section, we let Ai p denote the set of all such matrix representation of p. 

A degr ee-d pseudo-distribution {x} can similarly be represented as an ]R #tu P les ( rf ) x#tu P les ( rf ) 
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matrix. We say thatM is a matrix representation for {x} if M[a,/3] = IE x'W whenever^ and 
1 3 are multi-indices with \a\, \fi\ < d. 

Formulated this way, if M( X ) is the matrix representation of {x} and M p £ Ai p for some 
p £ 1R[x]< 2£// then IE p(x) = (M{ x },M p ). In this sense, pseudo-distributions and polynomials, 
each represented as matrices, are dual under the trace inner product on matrices. 

We are interested in optimization of polynomials over the sphere, and we have been 
looking at pseudo-distribution {x} satisfying {||x|| 2 -1 = 0}. From this matrix point of view, 
the polynomial ||x|| 2 - 1 corresponds to a vector zv £ IR #tu P les(l/) (in particular, the vector zv so 
that ww T is a matrix representation of (||x|| 2 - l) 2 ), and a degree-4 pseudo-distribution {x} 
satisfies {||x|| 2 - 1 = 0} if and only if zv £ kerM{ x ). 

A polynomial may have many matrix representations, but a pseudo-distribution has 
just one: a matrix representation of a pseudo-distribution must obey strong symmetry 
conditions in order to assign the same pseudo-expectation to every representation of the 
same polynomial. We will have much more to say about constructing matrices satisfying 
these symmetry conditions when we state and prove our lower bounds, but here we will 
in fact profit from relaxing these symmetry constraints. 

Let p £ R[x] <2d . In the matrix view, the SoS relaxation of the problem maX|| x || 2 =1 p(x) is 
the following convex program. 

max min(M,M„). (5.1) 

M:weker M M v eM v 
M> 0 

<M,Mi>=l 

It may not be immediately obvious why this program optimizes only over M which are 
matrix representations of pseudo-distributions. If, however, some M does not obey the 
requisite symmetries, then min Mp6 ;vp(M,Mp) = -oo, since the asymmetry may be exploited 
by careful choice of M p £ Ai p . Thus, at optimality this program yields M which is the 
matrix representation of a pseudo-distribution {x} satisfying {||x|| 2 -1 = 0}. 


5.1.2 Relaxing to the Degree-4 Dual 


We now formulate spectral SoS. In our analysis of full-power SoS for tensor PCA we have 
primarily considered pseudo-expectations of homogeneous degree-4 polynomials; our 
first step in further relaxing SoS is to project from R[x] <4 to HR[x] 4 . Thus, now our matrices 
M, AT will be in R ,rx ” 2 rather than ]R #tu P les ( 2 ) x#tu P les ( 2 ). The projection of the constraint on 
the kernel in the non-homogeneous case implies TrM = 1 in the homogeneous case. The 
projected program is 


max mm (M, M ). 

TrM=lM„eM„ 

M> 0 P P 

We modify this a bit to make explicit that the relaxation is allowed to add and subtract 
arbitrary matrix representations of the zero polynomial; in particular M|| x ||4 - Id for any 
M| W |4 £ A1|| X ||4. This program is the same as the one which precedes it. 


max 


min 


( M,M p — C ■ M|| X ||4> + c. 


(5.2) 



By weak duality, we can interchange the min and the max in (5.2) to obtain the dual 
program: 


max 

TrM=l 

M>0 


min (M,M„ - c ■ «) < 

MpEMp P m 

CE R 


min 


MpEMp 

M M 4£A W 

ceR 


max(M,M„ - c • Mm . 114 ) + c 

TrM=l H " " 

M>0 


= min ma x(w T ,M v - c ■ Mm . 114 ) + c 

MpEMp N|=l P 11X11 

4 

ceR 


(5.3) 

(5.4) 


We call this dual program the spectral SoS relaxation of maX | M | =1 p{x). lip = JL(x, A,x) for A 
with independent entries from N( 0,1), the spectral SoS relaxation achieves the same bound 
as our analysis of the full-strength SoS relaxation: for such p, the spectral SoS relaxation 
is at most 0(n W log(n) 1 ^ 2 ) with high probability. The reason is exactly the same as in our 
analysis of the full-strength SoS relaxation: the matrix £5 A, 0 A,, whose spectrum we used 
before to bound the full-strength SoS relaxation, is still a feasible dual solution. 


5.2 Recovery via the Yui Ti ® T\ Spectral SoS Solution 

Let T = x • Up 3 + A be the spiked-tensor input to tensor PCA. We know from our initial 
characterization of SoS proofs of boundedness for degree-3 polynomials that the polynomial 
T'(x) := (x ® x)' (Z, Ti ® Tj)(x 0 x) gives SoS-certifiable upper bounds on T(x) on the unit 
sphere. We consider the spectral SoS relaxation of max || Y || =1 T'(x), 

min ||M T (*)-c-M| W | 4 || + c. 

M-y( x )EMi( x ) 

M |W|4 eA1 | W |4 

ceR 

Our goal now is to guess a good M' € We will take as our dual-feasible solution 

the top singular vector of £,■ T, 0 T,- - E X, A; 0 A/. This is dual feasible with c = n, since 
routine calculation gives {x® 1 , (E £3 A t 0 A,)x® 2 ) = ||x|| 4 . This top singular vector, which 
differentiates the spectrum of JT Ti 0 T, from that of X, A,- 0 A/, is exactly the manifestation 
of the signal Vq which differentiates T(x) from A(x). The following algorithm and analysis 
captures this. 
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Recovery and Certification with Z; Ti 0 T z - 

Input: T = t • u® 3 + A, where Vo e R" and A is a 3-tensor. 

Goal: Find v G R" with \(v,Vo)\ > 1 - o(l). 

Algorithm 5.1 (Recovery). Compute the top (left or right) singular vector v' of M := 
Z* Ti 0 Ti - E X,, A 0> A,. Reshape v' into an n x n matrix V'. Compute the top singular 
vector v of V'. Output vl\\v\\. 

Algorithm 5.2 (Certification). Run Algorithm 5.1 to obtain v. Let S := T - v® 3 . Compute 
the top singular value A of 

^ Si 0 Si - IE Aj 0) At. 

i i 

If A < 0(rt’l 2 log(n) 1 / 2 ), output certify. Otherwise, output fail. 

The following theorem describes the behavior of Algorithm 5.1 and Algorithm 5.2 and 
gives a proof of Theorem 1.7 and Corollary 1.7. 

Theorem 5.3 (Formal version of Tteorem 1.7). Let T = t • iff + A, where v 0 G R" and A has 
independent entries from N(0, 1). In other words, we are given an instance of Problem 1.1. Let 
t > n 3/4 log(n) 1/,4 /e. Then: 

— With high probability, Algorithm 5.1 returns v with ( v,v 0 ) 2 > 1 - 0(e). 

— If Algorithm 5.2 outputs certify then T(x) < t • ( v,x ) 3 + 0(n 3/4 log(n) 1,/4 ) (regardless of 
the distribution of A). If A is distributed as above, then Algorithm 5.2 outputs certify with 
high probability. 

— Both Algorithm 5.1 and Algorithm 5.2 can be implemented in time 0(n 4 log(l/e))- 

The argument that Algorithm 5.1 recovers a good vector in the spiked tensor model 
comes in three parts: we show that under appropriate regularity conditions on the noise A 
that YjI Ti 0 Ti - E Aj 0 A, has a good singular vector, then that with high probability in the 
spiked tensor model those regularity conditions hold, and finally that the good singular 
vector can be used to recover the signal. 

Lemma 5.4. Let T = t ■ u® 3 + A be an input tensor. Suppose || £,• A; 0 A,- - E Z; A* 0 A,|| < ex 2 
and that || Zi^o(OA|| < £T. Then the top (left or right) singular vector v' ofM has ( v',v 0 0 l>o) 2 ^ 
1 - 0(e). 

Lemma 5.5. Let T = t • vff + A. Suppose A has independent entries from N(0, 1). Then with high 
probability we have HZ/A0A, - E Z/A/ 0 A,|| < 0(n 3/2 log(n) 1/2 ) and HZj^oOVMI < O(yfn). 

2 

Lemma 5.6. Let v 0 G R" and v' G R ,r be unit vectors so that ( v',v 0 0 v 0 ) > 1 - 0(e). Then the 
top right singular vector v of the nxn matrix folding V' ofv' satisfies (v,v Q ) > 1 - 0(e). 

A similar fact to Lemma 5.6 appears in [MR14]. 

The proofs of Lemma 5.4 and Lemma 5.6 follow here. The proof of Lemma 5.5 uses 
only standard concentration of measure arguments; we defer it to Section B. 
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Proof of Lemma 5.4. We expand M as follows. 

M = JV- (vf)i 0 (vf)i + t • (( vf)i 0 Aj + Ai 0 (vf),) + Ai 0 Ai - IE A, 0 Aj 

i 

= t 2 • (v 0 0 v 0 )(v 0 0 v 0 ) T + t • v 0 vl 0 ^ v 0 (i)Aj + t • ^ v 0 (i)Aj 0 v 0 vl + Aj 0 ) Aj - E A, 0 A,-. 

i i 

By assumption, the noise term is bounded in operator norm: we have || JL A,0 A z — IE JT A; 0 
A, || < £t 2 . Similarly, by assumption the cross-term has ||t • VoV^ 0 ^o(OA z || < ex 2 . 

T • Y P ^H 3 )i ® A ‘ + ® (O.-^ = T ' X v o(i) p iAv 0 vl 0 A, + A; 0 v 0 vl)P u ± . 

i i 

All in all, by triangle inequality, 

t • v 0 vl 0 ^ v 0 (i)Aj + t • ^ v 0 (i)Aj 0 v 0 Vq + Aj 0 A, - E A, 0 A,- 

i i 

Again by triangle inequality, 

||M|| > (z> 0 0 Uo) T M(u 0 0 Vo) = T 2 - 0(£T 2 ) . 

Let u, w be the top left and right singular vectors of M. We have 

U T Mw = T 2 ■ (U , Uo 0 Vo){w, Vo 0 u 0 ) + 0(£T 2 ) > T 2 — 0(£T 2 ) , 

so rearranging gives the result. □ 

Proof of Lemma 5.6. Let Vq, v', V', v, be as in the lemma statement. We know v is the 
maximizer of maxi^y^i^! w T V'w'. By assumption, 

v^V'vo = (v',Vo 0 Vo) >1- 0(f). 

Thus, the top singular value of V' is at least 1 - 0(e), and since ||A|| is a unit vector, the 
Frobenius norm of V' is 1 and so all the rest of the singular values are O(e). Expressing v 0 
in the right singular basis of V' and examining the norm of V'v 0 completes the proof. □ 

Proof of Theorem 5.3. The first claim, that Algorithm 5.1 returns a good vector, follows from 
the previous three lemmas. Lemma 5.4, Lemma 5.5, Lemma 5.6. The next, for Algorithm 5.2, 
follows from noting that £,■ S, 0 Sj - E JT Aj 0 Aj is a feasible solution to the spectral SoS dual. 
For the claimed runtime, since we are working with matrices of size n 4 , it will be enough to 
show that the top singular vector of M and the top singular value of £,• S, 0 Sj - IE X,-A, 0 A; 
can be recovered with 0(poly log(n)) matrix-vector multiplies. 

In the first case, we start by observing that it is enough to find a vector w which has 
(w, v') > 1 - £, where v' is a top singular vector of M. Let X\, A 2 be the top two singular 
values of M. The analysis of the algorithm already showed that A 1 /A 2 > Q(l/£). Standard 
analysis of the matrix power method now yields that 0(log(l/£)) iterations will suffice. 


< 0(£T 2 ). 
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We finally turn to the top singular value of X,, S, ® S, - IE A* ® A,. Here the matrix 
may not have a spectral gap, but all we need to do is ensure that the top singular value is 
no more than 0(n 3 ^ 2 log(n) 1 ^ 2 ). We may assume that some singular value is greater than 
0(n 3 ^ 2 log(n) 1 / 2 ). If all of them are, then a single matrix-vector multiply initialized with a 
random vector will discover this. Otherwise, there is a constant spectral gap, so a standard 
analysis of matrix power method says that within 0(log n) iterations a singular value 
greater than 0(n 3 ^ 2 log(n) 1 ^ 2 ) will be found, if it exists. □ 

5.3 Nearly-Linear-Time Recovery via Tensor Unfolding and Spectral 
SoS 

On input T = t • v ® 3 + A, where as usual Vq e IR" and A has independent entries from 
Af(0,1), Montanari and Richard's Tensor Unfolding algorithm computes the top singular 
vector u of the squarest-possible flattening of T into a matrix. It then extracts v with 
< v,Vq ) 2 > 1 - o(l) from u with a second singular vector computation. 


Recovery with TT t , a.k.a. Tensor Unfolding 

Input: T = t • v ® 3 + A, where v 0 e IR” and A is a 3-tensor. 

Goal: Find v G IR" with \(v r v 0 )\ > 1 - o(l). 

Algorithm 5.7 (Recovery). Compute the top eigenvector v of M := T t T. Output v. 

We show that this algorithm successfully recovers a vector v with (v, Vq ) 2 > 1 — 0(e) 
when t > n 3l4r /e. Montanari and Richard conjectured this but were only able to show it 
when t > n. We also show how to implement the algorithm in time 0(n 3 ), that is to say, in 
time nearly-linear in the input size. 

Despite its a priori simplicity, the analysis of Algorithm 5.7 is more subtle than for any 
of our other algorithms. This would not be true for even-order tensors, for which the 
square matrix unfolding tensor has one singular value asymptotically larger than all the 
rest, and indeed the corresponding singular vector is well-correlated with Vq. However, 
in the case of odd-order tensors the unfolding has no spectral gap. Instead, the signal v () 
has some second-order effect on the spectrum of the matrix unfolding, which is enough to 
recover it. 

We first situate this algorithm in the SoS framework. In the previous section we 
examined the feasible solution £,• T, ® T, - E JT A; ® A, to the spectral SoS relaxation of 
max|| r || = i T (x). The tensor unfolding algorithm works by examining the top singular vector 
of the flattening T of T, which is the top eigenvector of the nxn matrix M = T r T, which 
in turn has the same spectrum as the n 2 x n 2 matrix TT t . The latter is also a feasible dual 
solution to the spectral SoS relaxation of max|| x || =1 T(x). However, the bound it provides 
on max|| z ||=i T(x) is much worse than that given by X; T ® U- The latter, as we saw in the 
preceding section, gives the bound 0(n 3 ^ log(n) 1 ^ 4 ). The former, by contrast, gives only 
0(n), which is the operator norm of a random n 2 x n matrix (see Lemma B.4). This n versus 
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w 3,4 is the same as the gap between Montanari and Richard's conjectured bound and what 
they were able to prove. 

Theorem 5.8. For an instance of Problem 1.1 with x > n 3/4 /e, with high probability Algorithm 5.7 
recovers a vector v with (v,v 0 ) 2 > 1 - 0(e). Furthermore, Algorithm 5.7 can be implemented in 
time 0(n 3 ). 

Lemma 5.9. Let T = t ■ v® 3 + A zvhere Vo e IR” is a unit vector, so an instance of Problem 1.1. 
Suppose A satisfies A T A = C ■ ld nxn + E for some C > 0 and E with ||E|| < ex 2 and that 
|| A T (v 0 0 u 0 )ll < ex. Let u be the top left singular vector of the matrix T. Then (vq,u ) 2 > 1 - 0(e). 

Proof. The vector u is the top eigenvector of the n x n matrix TT t , which is also the top 
eigenvector of M := TT t - C • Id. We expand: 

u t Mu = u T [x 2 • v 0 vl + x • u 0 (uo ® V 0 ) T A + x • A T (v 0 0 u 0 )Uq + e] m 

= x 2 • {u, Vo) 2 + u T [x • x»o(^o 0> Vo) t A + x • A T (v 0 0 Vo)Vq + e] m 

< t 2 (u,v 0 ) 2 + 0(ex 2 ). 

Again by triangle inequality, u T Mu > zfMv = x 2 - 0(ex 2 ). So rearranging we get 
(u,Vq) 2 > 1 - 0(e) as desired. □ 

The following lemma is a consequence of standard matrix concentration inequalities; 
we defer its proof to Section B, Lemma B.10. 

Lemma 5.10. Let A have independent entries from N( 0,1). Let v 0 e IR" be a unit vector. With 
high probability, the matrix A satisfies A T A = n 2 ■ Id + E for some E with ||E|| < 0(n 3/1 ) and 
II A T {v 0 0u 0 )|| < 0( yjnlogn). 

The final component of a proof of Theorem 5.8 is to show how it can be implemented in 
time 0(n 3 ). Since M factors as T t T, a matrix-vector multiply by M can be implemented 
in time 0(n 3 ). Unfortunately, M does not have an adequate eigenvalue gap to make 
matrix power method efficient. As we know from Lemma 5.10, suppressing es and 
constants, M has eigenvalues in the range n 2 ± n 3 ^ 2 . Thus, the eigenvalue gap of M is at 
most g = 0(1 + 1/ yfn). For any number k of matrix-vector multiplies with k < n 1/,2_6 , the 
eigenvalue gap will become at most (1 + 1/ f~n)" ~ -6 , which is subconstant. To get around 
this problem, we employ a standard trick to improve spectral gaps of matrices close to 
C • Id: remove C • Id. 

Lemma 5.11. Under the assumptions of Theorem 5.8, Algorithm 5.7 can be implemented in time 
0(n 3 ) (which is linear in the input size, n 3 ). 

Proof. Note that the top eigenvector of M is the same as that of M - n 2 ■ Id. The latter matrix, 
by the same analysis as in Lemma 5.9, is given by 

M-n 2 ■ Id = x 2 • v 0 vl + M' 

where ||M'|| = 0(ex 2 ). Note also that a matrix-vector multiply by M - n 2 ■ Id can still be 
done in time 0(n 3 ). Thus, M-n 2 ■ Id has eigenvalue gap Q(l/e), which is enough so that 
the whole algorithm runs in time 0(n 3 ). □ 

Proof of Theorem 5.8. Immediate from Lemma 5.9, Lemma 5.10, and Lemma 5.11. □ 
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5.4 Fast Recovery in the Semi-Random Model 

There is a qualitative difference between the aggregate matrix statistics needed by our 
certifying algorithms (Algorithm 4.1, Algorithm 4.2, Algorithm 5.1, Algorithm 5.2) and 
those needed by rounding the tensor unfolding solution spectral SoS Algorithm 5.7. In a 
precise sense, the needs of the latter are greater. The former algorithms rely only on first- 
order behavior of the spectra of a tensor unfolding, while the latter relies on second-order 
spectral behavior. Since it uses second-order properties of the randomness. Algorithm 5.7 
fails in the semi-random model. 

Theorem 5.12. Let T = x • x>® 3 + A, ivhere v 0 e IR" is a unit vector and A has independent entries 
from 7V(0, 1)- There is t = Q (n 7/8 ) so that with high probability there is an adversarial choice of 
Q with ||Q - Id || < 0(ft” 1/4 ) so that the matrix ( TQ) t TQ = n 2 ■ Id. In particular, for such x. 
Algorithm 5.7 cannot recover the signal v 0 . 

Proof. Let M be the n x n matrix M := T t T. LetQ = ft-M” 1 / 2 . It is clear that (TQ) t TQ = n 2 Id. 
It suffices to show that ||Q — Id || < n 1 /4 with high probability. We expand the matrix M as 

M = T 2 • v 0 vl + T • Oq(Oo ® Vo) T A + T • A T (v 0 0 v 0 )vl + A t A . 

By Lemma 5.10, A T A = n 2 ■ Id + E for some E with ||E|| < 0(ft 3 / 2 ) and ||A T (ft 0 0 Oo)|| < 
0( \jn log n), both with high probability. Thus, the eigenvalues of M all lie in the range 
n 2 ± n 1+3 A The eigenvalues of Q in turn lie in the range 

n 11 

(; n 2 ± 0(n 1+3/4 )) 1/2 = (1 ± = 1 ± 0(n 4 / 4 )' 

Finally, the eigenvalues of Q - Id lie in the range 1+c , 4 ;1/4) -1 = ±0(n~ 1/4 ), so we are done. □ 

The argument that that Algorithm 5.1 and Algorithm 5.2 still succeed in the semi¬ 
random model is routine; for completeness we discuss here the necessary changes to the 
proof of Theorem 5.3. The non-probabilistic certification claims made in Theorem 5.3 are 
independent of the input model, so we show that Algorithm 5.1 still finds the signal with 
high probability and that Algorithm 5.2 still fails only with only a small probability. 

Theorem 5.13. In the semi-random model, e > n _1/4 and x > n 3/,4 log(n) 1/4 /£, with high 
probability, Algorithm 5.1 returns v with (v,v 0 ) 2 > 1 - 0(e) and Algorithm 5.2 outputs certify. 

Proof. We discuss the necessary modifications to the proof of Theorem 5.3. Since e > ft” 1 / 4 , 
we have that ||(Q - Id)z> 0 || < 0(e). It suffices then to show that the probabilistic bounds in 
Lemma 5.5 hold with A replaced by AQ. Note that this means each A, becomes A,Q. By 
assumption, ||Q0Q-Id0ld || < 0(e), so the probabilistic bound on || £,-A,0A, = E A,0A;|| 
carries over to the semi-random setting. A similar argument holds for JTi;o(z‘)A;Q, which 
is enough to complete the proof. □ 
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5.5 Fast Recovery with Symmetric Noise 

We suppose now that A is a symmetric Gaussian noise tensor; that is, that A is the average 
of Aq over all n £ S 3 , for some order-3 tensor A 0 with iid standard Gaussian entries. 

It was conjectured by Montanari and Richard [MR14] that the tensor unfolding technique 
can recover the signal vector v 0 in the single-spike model T = tv ® 3 + A with signal-to-noise 
ratio t > Q(n 3 ^ 4 ) under both asymmetric and symmetric noise. 

Our previous techniques fail in this symmetric noise scenario due to lack of indepen¬ 
dence between the entries of the noise tensor. However, we sidestep that issue here by 
restricting our attention to an asymmetric block of the input tensor. 

The resulting algorithm is not precisely identical to the tensor unfolding algorithm 
investigated by Montanari and Richard, but is based on tensor unfolding with only 
superficial modifications. 


Fast Recovery under Symmetric Noise 

Input: T = t • v ® 3 + A, where v 0 £ 1R" and A is a 3-tensor. 

Goal: Find v £ IR" with \(v,Vo)\ > 1 - o(l). 

Algorithm 5.14 (Recovery). Take X, Y, Z a random partition of [n], and R a random rotation 
of IR". Let Px, Py, and Pz be the diagonal projectors onto the coordinates indicated by X, Y, 
and Z. Let U := R® 3 T, so that we have the matrix unfolding U := (R 0 R)TR t Using the 
matrix power method, compute the top singular vectors v x , v Y , and v z respectively of the 
matrices 

M x := PxU T (Py (8 P Z )UP X - n 2 /9 • Id 
M y := P Y U T (P Z <8 P X )UP Y - n 2 /9 ■ Id 
M z := P Z U T (P X <8 P Y )UP Z - n 2 /9 ■ Id . 

Output the normalization of R _1 (ux + v Y + v z ). 


Remark 5.15 (Implementation of Algorithm 5.14 in nearly-linear time.). It is possible to 
implement each iteration of the matrix power method in Algorithm 5.14 in linear time. We 
focus on multiplying a vector by M x in linear time; the other cases follow similarly. 

We can expand Mx = P X RT T (R<8R) T (P y ®P z )(R®R)TR t P x - n 2 /9 -Id. It is simple enough 
to multiply an n-dimensional vector by P x , R, R T , T, and Id in linear time. Furthermore 
multiplying an ;z 2 -dimensional vector by T t is also a simple linear time operation. The 
trickier part lies in multiplying an u 2 -dimensional vector, say u, by the n 2 -by-n 2 matrix 

(R®R) T (P y ®P z )(R®R). 

To accomplish this, we simply reflatten our tensors. Let V be the n-by-n matrix 
flattening of v. Then we compute the matrix R T P Y R ■ V ■ R T P Z T R, and return its flattening 
back into an u 2 -dimensional vector, and this will be equal to (R (8) R) T (P Y (8 P Z )(R <8 R) v. 
This equivalence follows by taking the singular value decomposition V = Yj, AjUjwJ, and 
noting that v = Yi Aqq <8 Wj. 
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Lemma 5.16. Given a unit vector u e R”, a random rotation R over R”, and a projection P to an 
m-dimensional subspace, with high probability 


\\PRu \\ 2 - m/n 




0 (^m/n 2 log m). 


Proof. Let y be a random variable distributed as the norm of a vector in R" with entries 
independently drawn from N(0,l/n). Then because Gaussian vectors are rotationally 
invariant and Ru is a random unit vector, the coordinates of yRu are independent and 
Gaussian in any orthogonal basis. 

So y 2 ||PRi/|| 2 is the sum of the squares of m independent variables drawn from N( 0,1 /n). 
By a Bernstein inequality, |y 2 ||PRz./|| 2 - m/n| < 0( fm/n 2 log m) with high probability. Also 
by a Bernstein inequality, y 2 - 1 < 0( y/ljn log n) with high probability. □ 

Theorem 5.17. For t > n 3/4 /e, with high probability, Algorithm 5.14 recovers a vector v with 
(v,v 0 ) > 1 - 0(e) when A is a symmetric Gaussian noise tensor (as in Problem 1.3) and 
£ > log(ft)/ xfn. 

Furthermore the matrix power iteration steps in Algorithm 5.14 each converge within 0(- log(e)) 
steps, so that the algorithm overall runs in almost linear time 0 (n 3 log(l/e)). 

Proof. Name the projections U x ■- (Py 0 P Z )UP X , U Y ■= (Pz 0 Px )U Py, and 11/ := (P X 0 
Py)UP z . 

First off, U = z(Rvo )® 3 + A' where A' is a symmetric Gaussian tensor (distributed 
identically to A). This follows by noting that multiplication by R® 3 commutes with 
permutation of indices, so that (R® 3 B) 71 = R® 3 B", where we let B be the asymmetric 
Gaussian tensor so that A = E n6l s 3 B 71 . Then A' = R® 3 Tj n eS 3 B 71 = E n e 5 3 (^ 3 B) 7T . This is 
identically distributed with A, as follows from the rotational symmetry of B. 


Thus U x = t(P y 0 P Z )(R 0 R)(v 0 0 v 0 )(P x Rv 0 ) T + (P y 0 P Z )A'P X , and 
M x + n 2 / 9 • Id = UfU x 

= T 2 \\P Y Rvo\\ 2 \\P z Rvo\\ 2 (P x Rv 0 )(P x Rv 0 ) T (5.5) 

+ t(P x Rv 0 )(v 0 0 V 0 ) T {R 0 R) r (P y 0 P Z )A'P X (5.6) 

+ tP x A' t (P y 0 P Z )(R 0 R)(v o 0 v 0 )(P x Rv 0 ) t (5.7) 

+ P x A' t (P y 0 P Z )A'P X . (5.8) 


Let S refer to Expression 5.5. By Lemma 5.16, |||PRy 0 [| 2 - || c 0(yjl/n log n) with 
high probability for P G {P X ,P Y ,P Z }. Hence S - (| ± 0( fl/n log/i))T 2 (P x Rv 0 ) (P x Rv 0 ) T and 
IIS|| = {jy ±0( xfljn log n))x 2 . 

Let C refer to Expression 5.6 so that Expression 5.7 is C T . Let also A" = (Py 0 P Z )A'P X . 
Note that, once the identically-zero rows and columns of A" are removed. A" is a matrix 
of iid standard Gaussian entries. Finally, let v" = P y Rvq 0 P z Rv 0 . By some substitution 
and by noting that ||P X R|| < 1, we have that ||C|| < t \\v 0 v” t A"\\. Hence by Lemma B.10, 
IICII < 0(e t 2 ). 

Let N refer to Expression 5.8. Note that N = A" T A" . Therefore by Lemma 5.10, 
\\N - n 2 /9 • Id || < 0(n 312 ). 
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Thus M x = S + C + (N - n 2 /9 ■ Id), so that || M x - S|| < 0(ex 2 ). Since S is rank-one and 
has ||S|| > Q(t 2 ), we conclude that matrix power iteration converges in 0(- log e) steps. 

The recovered eigenvector v x satisfies (v x ,M x Vx) > Q(t 2 ) and (v x , (M x - S)v x ) < 0(ex 2 ) 
and therefore (v x ,Sv x ) - ± 0(e + yjl/n log n))z 2 . Substituting in the expression for S, 

we conclude that (P x Rv 0 ,v x ) = (^= ± 0(e + yjl/nlogri)). 

The analyses for v Y and v z follow in the same way. Hence 


(vx + v y + v z ,Rv, o) - (v x ,P x Rv o) + (vy,P y Rv o) + < v Zr PzRv 0 ) 

> V3-0(e+ sfljn log n). 

At the same time, since v x , Vy, and v z are each orthogonal to each other, \\v x + v Y + v z \\ = V3. 
Hence with the output vector being v := R~ x (v x + v Y + v z )l\\v x + v Y + v z \\, we have 

(v,v Q ) = (Rv,Rv o) = -^<^x + v Y + v z ,Rv 0 ) > 1- 0(e + V^Mlogn). 

□ 


5.6 Numerical Simulations 



matrix-vector multiplies instance size 

Figure 1: Numerical simulation of Algorithm 5.1 ("Nearly-optimal spectral SoS" implemented with matrix 
power method), and two implementations of Algorithm 5.7 ("Accelerated power method"/"Nearly-linear 
tensor unfolding" and "Naive power method"/"Naive tensor unfolding". Simulations were run in Julia on 
a Dell Optiplex 7010 running Ubuntu 12.04 with two Intel Core i7 3770 processors at 3.40 ghz and 16GB 
of RAM. Plots created with Gadfly. Error bars denote 95% confidence intervals. Matrix-vector multiply 
experiments were conducted with n = 200. Reported matrix-vector multiply counts are the average of 50 
independent trials. Reported times are in cpu-seconds and are the average of 10 independent trials. Note 
that both axes in the right-hand plot are log scaled. 


We report now the results of some basic numerical simulations of the algorithms from 
this section. In particular, we show that the asymptotic running time differences among 
Algorithm 5.1, Algorithm 5.7 implemented naively, and the linear-time implementation of 
Algorithm 5.7 are apparent at reasonable values of n, e.g. n = 200. 
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Specifics of our experiments are given in Figure 1. We find pronounced differences 
between all three algorithms. The naive implementation of Algorithm 5.7 is markedly 
slower than the linear implementation, as measured either by number of matrix-vector 
multiplies or processor time. Algorithm 5.1 suffers greatly from the need to construct 
an n 2 X n 2 matrix; although we do not count the time to construct this matrix against its 
reported running time, the memory requirements are so punishing that we were unable to 
collect data beyond n = 100 for this algorithm. 

6 Lower Bounds 

We will now prove lower bounds on the performance of degree-4 SoS on random instances 
of the degree-4 and degree-3 homogeneous polynomial maximization problems. As an 
application, we show that our analysis of degree-4 for Tensor PCA is tight up to a small 
logarithmic factor in the signal-to-noise ratio. 

Theorem 6.1 (Part one of formal version of Theorem 1.5). There is t = Q (n) and a function 
ij:Ah {x} mapping 4-tensors to degree-4 pseudo-distributions satisfying {||x|| 2 = 1} so that for 
every unit vector Vq, if A has unit Gaussian entries, then, with high probability over random choice 
of A, the pseudo-expectation EAi^a) t • {v 0 , x) 4 + A(x) is maximal up to constant factors among 
E t • (v 0r y ) 4 + A(y) over all degree-4 pseudo-distributions {y) satisfying {||y || 2 = 1}. 

Theorem 6.2 (Part two of formal version of Theorem 1.5). There is x = Q(n 3 ^ 4 /(log n) 1 ^) and 
a function r \: A i-> {x} mapping 3-tensors to degree-4 pseudo-distributions satisfying {||x || 2 = 1} 
so that for every unit vector v 0 , if A has unit Gaussian entries, then, zvith high probability over 
random choice of A, the pseudo-expectation EAj^a) t • (ooA ) 3 + A(x) is maximal up to logarithmic 
factors among E x • ( v 0 , y ) 3 + A(y) over all degree-4 pseudo-distributions {y} satisfying {||y || 2 = 1(. 

The existence of the maps p depending only on the random part A of the tensor PCA 
input Op 3 + A formalizes the claim from Theorem 1.5 that no algorithm can reliably recover 
v 0 from the pseudo-distribution rj( A). 

Additionally, the lower-bound construction holds for the symmetric noise model also: 
the input tensor A is symmetrized wherever it occurs in the construction, so it does not 
matter if it had already been symmetrized beforehand. 

The rest of this section is devoted to proving these theorems, which we eventually 
accomplish in Section 6.2. 

6.0.1 Discussion and Outline of Proof 

Given a random 3-tensor A, we will take the degree-3 pseudo-moments of our 77 (A) to be 
eA, for some small e, so that E X ^, ; ( A ) A(x) is large. The main question is how to give degree-4 
pseudo-moments to go with this. We will construct these from AA T and its permutations 
as a 4-tensor under the action of <S 4 . 

We have already seen that a spectral upper bound on one of these permutations, X, A® A:, 
provides a performance guarantee for degree-4 SoS optimization of degree-3 polynomials. 
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It is not a coincidence that this SoS lower bound depends on the negative eigenvalues 
of the permutations of AA T . Running the argument for the upper bound in reverse, 
a pseudo-distribution {x} satisfying {||x||| = 1} and with EA(i) large must (by pseudo- 
Cauchy-Schwarz) also have E(x® 2 , QT A <g> A) x® 2 ) large. The permutations of AA T are all 
matrix representations of that same polynomial, (x® 2 , (X, A ® A t ) x® 2 ). Hence E A(x) will 
be large only if the matrix representation of the pseudo-distribution {x} is well correlated 
with the permutations of AA T . Since this matrix representation will also need to be 
positive-semi definite, control on the spectra of permutations of AA T is therefore the key to 
our approach. 

The general outline of the proof will be as follows: 

1. Construct a pseudo-distribution that is well correlated with the permutations of AA T 
and gives a large value to E A(x), but which is not on the unit sphere. 

2. Use a procedure modifying the first and second degree moments of the pseudo¬ 
distribution to force it onto a sphere, at the cost of violating the condition that 
E p(X) 2 > 0 for all p e 1 R[x]^ 2 / then rescale so it lives on the unit sphere. Thus, we end 
up with an object that is no longer a valid pseudo-distribution but a more general 
linear functional X on polynomials. 

3. Quantitatively bound the failure of X to be a pseudo-distribution, and repair it 
by statistically mixing the almost-pseudo-distribution with a small amount of the 
uniform distribution over the sphere. Show that E A(x) is still large for this new 
pseudo-distribution over the unit sphere. 

But before we can state a formal version of our theorem, we will need a few facts 
about polynomials, pseudo-distributions, matrices, vectors, and how they are related by 
symmetries under actions of permutation groups. 

6.1 Polynomials, Vectors, Matrices, and Symmetries, Redux 

Here we further develop the matrix view of SoS presented in Section 5.1.1. 

We will need to use general linear functionals X : R[x ] <4 -> Ron polynomials as an 
intermediate step between matrices and pseudo-distributions. Like pseudo-distributions, 
each such linear-functional X has a unique matrix representation M £ satisfying certain 
maximal symmetry constraints. The matrix M £ is positive-semidefinite if and only if 
Xp(x ) 2 > 0 for every p. If X satisfies this and X1 = 1, then X is a pseudo-expectation, and 
M £ is the matrix representation of the corresponding pseudo-distribution. 

6.1.1 Matrices for Linear Functionals and Maximal Symmetry 

Let X : IR[x]^ —» 1R. X can be represented as an n #tupies(d) x ;-;#iLipics(d) ma |- r i x indexed by all 
d'-tuples over [n] with d' < d/2. For tuples a, (5, this matrix M £ is given by 


M £ [a,[3\ = XxV. 
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For a linear functional X : IR[.r]^ —» R, a polynomial p(x) G R[x]^/ and a matrix 
representation M v for p we thus have (M_r, M p ) = X p(x). 

A polynomial in R[jc]^j may have many matrix representations, while for us, a linear 
functional X has just one: the matrix M_f. This is because in our definition we have required 
that Mx obey the constraints 

Mx[n,j6] = Mx[n',j8'] when x a x$ = x a 'x? . 

in order that they assign consistent values to each representation of the same polynomial. 
We call such matrices maximally symmetric (following Doherty and Wehner [DW12]). 

We have particular interest in the maximally-symmetric version of the identity matrix. 
The degree-d symmetrized identity matrix Id sym is the unique maximally symmetric matrix 
so that 

(x (8 ’ rf/2 ,Id sym x (8 ’ rf/2 ) = \\x\\ d 2 . (6.1) 

The degree d will always be clear from context. 

In addition to being a matrix representation of the polynomial \\x\\ d 2 , the maximally 
symmetric matrix Id sym also serves a dual purpose as a linear functional. We will often be 
concerned with the expectation operator E-‘ for the uniform distribution over the n-sphere, 
and indeed for every polynomial p(x) with matrix representation M p , 

and so Id sym /(n 2 + 2 n) is the unique matrix representation of E f '. 

6.1.2 The Monomial-Indexed (i.e. Symmetric) Subspace 

We will also require vector representations of polynomials. We note that R[x]<d /2 has 
a canonical embedding into IR #lLiples( T as the subspace given by the following family of 
constraints, expressed in the basis of d'-tuples for d' < d/2: 

R[x] <d/2 - \p e R #luplcs(rf) such that p a = p a , if a' is a permutation of a }. 

We let n be the projector to this subspace. For any maximally-symmetric M we have 
FLATTI = M, but the reverse implication is not true (for readers familiar with quantum 
information: any M which has M = FLATTI is Bose-symmetric, but may not be BBT-symmetric, 
maximally symmetric matrices are both. See [DW12] for further discussion.) 

If we restrict attention to the embedding this induces of R[x ]^/2 (i.e. the homogeneous 
degree-d/2 polynomials) into R" ", the resulting subspace is sometimes called the symmetric 
subspace and in other works is denoted by V rf/2 R n . We sometimes abuse notation and let n 
be the projector from R ,! ‘ “ to the canonical embedding of R[x] rf/2 . 

6.1.3 Maximally-Symmetric Matrices from Tensors 

The group S,j acts on the set of d-tensors (canonically flattened to matrices R n '" /Px » rrf/P ) py 
permutation of indices. To any such flattened M G R" " x,! ~, we associate a family of 


27 



maximally-symmetric matrices SymM given by 

SymM = f i t ^ 7i • M for all t > 0 > . 

V neS d ) 

That is, Sym M represents all scaled averages of M over different possible flattenings of its 
corresponding d- tensor. The following conditions on a matrix M are thus equivalent: (1) 
M G SymM, (2) M is maximally symmetric, (3) a tensor that flattens to M is invariant under 
the index-permutation action of Sd, and (4) M may be considered as a linear functional on 
the space of homogeneous polynomials IR[x]<f. When we construct maximally-symmetric 
matrices from un-symmetric ones, the choice of t is somewhat subtle and will be important 
in not being too wasteful in intermediate steps of our construction. 

There is a more complex group action characterizing maximally-symmetric matrices in 
jj^#tupies(if)x#tupies(rf)^ w bjch projects to the action of Sd' under the projection of ]R #tu P les W x #tupies(d) 

to IR""' /2x ” d b We will never have to work explicitly with this full symmetry group; instead 
we will be able to construct linear functionals on IR[x]^ (i.e. maximally symmetric matrices 
in ]p #tu P les ( rf ) x#tu P les ( rf )j by symmetrizing each degree (i.e. each d' < d) more or less separately. 

6.2 Formal Statement of the Lower Bound 

We will warm up with the degree-4 lower bound, which is conceptually somewhat simpler. 

Theorem 6.3 (Degree-4 Lower Bound, General Version). Let Abe a 4-tensor and let A > 0 be 
a function ofn. Suppose the following conditions hold: 

— A is significantly correlated with L, n&S4 A 71 . 

<A,L„ rf .A">»Q(K 4 ). 

— Permutations have lower-bounded spectrum. 

For every n e <S 4 , the Hermitian n 2 x n 2 unfolding |( A n + (A n ) T ) of A n has no eigenvalues 
smaller than -A 2 . 

— Using A as 4th pseudo-moments does not imply that ||x|| 4 is too large. 

For every n € <S 4 , we have (Id sym , A n ) < 0(A 2 n 3/1 ) 

— Using A for 4th pseudo-moments does not imply first and second degree moments 
are too large. 

Let £, : R[x] 4 IR be the linear functional given by the matrix representation M_r 
XTT £^4 A • 


62 = max |X ||x|||x,xy 
62 = f max \-£ llxll^x?! 


Then n 3 / 2 <5' + n 2 5 2 < 0(1). 
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Then there is a degree-4 pseudo-distribution {x} satisfying {||x||| = 1} so that EA(i) > Q (n 2 /A 2 ) + 
®(WA(x)). 

The degree-3 version of our lower bound requires bounds on the spectra of the 
flattenings not just of the 3-tensor A itself but also of the flattenings of an associated 
4-tensor, which represents the polynomial ( x® 2 , (Z; A ; ® Afx® 2 ). 

Theorem 6.4 (Degree-3 Lower Bound, General Version). Let Abe a 3-tensor and let A > 0 be 
a function ofn. Suppose the following conditions hold: 

— A is significantly correlated with YjneS 3 A 71 . 

<A,L„ S 5 : .A»)>Q(,i 3 ). 

— Permutations have lower-bounded spectrum. 

For every n e S 3 , we have 

-2A 2 • n Id n < ^U(o ■ A n (A n ) T + o 2 ■ A n (A n ) T )U + in(d • A n (A n ) T + o 2 ■ A n (A n ) T ) T n ■ 


— Using AA T for 4th moments does not imply ||x || 4 is too large. 
For every n e S 3/ we have (Id sym , A n (A n ) T ) < 0(A 2 n 2 ) 


— Using A and AA T for 3rd and 4th moments do not imply first and second degree 
moments are too large. 

Let n e S 3 . Let J2 : R[x ] 4 —> R be the linear functional given by the matrix representation 
M £ := ^ Z n ' 6 5 4 n> ■ AA T . Let 



62 


def 


5 


2 


def 


max 

i 

max 

i*i 

max 


|X Nxlllx/xyl 

\-C\\x\\lx 2 - \Z\\x\W 


Then nb\ + n 3/2 5 ' + n 2 b 2 < 0(1). 

Then there is a degree-4 pseudo-distribution {x} satisfying {\\x\\ 2 = 1} so that 

(n 3/2 \ 

E A(x) > Q — + 0(E f ' A(x)). 


6.2.1 Proof of Theorem 6.2 

We prove the degree-3 corollary; the degree-4 case is almost identical using Theorem 6.3 
and Lemma B.12 in place of their degree-3 counterparts. 

Proof. Let A be a 3-tensor. If A satisfies the conditions of Theorem 6.4 with A = 
0(n 3/4 log(n) 1/4 ), we let 77 (A) be the pseudo-distribution described there, with 

/ n 3/2 \ 

E mj(a) A(x) > Q — + 0(E fl A(x)) 


29 







If A does not satisfy the regularity conditions, we let 77(A) be the uniform distribution on 
the unit sphere. If A has unit Gaussian entries, then Lemma B.ll says that the regularity 
conditions are satisfied with this choice of A with high probability. The operator norm of A 
is at most 0( xfn), so E f ' A(x) = 0( xfn) (all with high probability) [TS14], We have chosen 
A and t so that when the conditions of Theorem 6.4 and the bound on E“ A(x), obtain, 

/ n 3/4 

T • {Vq,X? + A(x) > Q --- 7 

\l°g ( n ) 1/4 

On the other hand, our arguments on degree-4 SoS certificates for random polynomials 
say with high probability every degree-4 pseudo-distribution {y} satisfying {||y || 2 = 1} has 
E t ■ (v, y) 3 + A(y) < 0(n 3 ^ 4 log (n) l A). Thus, {x} is nearly optimal and we are done. □ 

6.3 In-depth Preliminaries for Pseudo-Expectation Symmetries 

This section gives the preliminaries we will need to construct maximally-symmetric 
matrices (a.k.a. functionals £. : IR[.r ]<;4 —» R) in what follows. For a non-maximally- 
symmetric M G R” 2x ” 2 under the action of 1 S 4 by permutation of indices, the subgroup 
C3 < 1 S 4 represents all the significant permutations whose spectra may differ from one 
another in a nontrivial way. The lemmas that follow will make this more precise. For 
concreteness, we take C3 = (0) with 0 = (234), but any other choice of 3-cycle would lead 
to a merely syntactic change in the proof. 

Lemma 6.5. Let D 8 < S4 be given by D 8 = ((12), (34), (13)(24)). Let C3 = {(),cr,cr 2 } = (0), 
where () denotes the identity in S 4. Then {yh : y G D 8r h e C3} = S4. 

Proof. The proof is routine; we provide it here for completeness. Note that C3 is a subgroup 
of order 3 in the alternating group ^ 4 . This alternating group can be decomposed as 
^4 = W 4 ■ C 3 , where W 4 = ((12)(34), (13)(24)) is a normal subgroup of ^ 4 . We can also 
decompose <S 4 = C 2 ■ AA 4 where C 2 = ((12)) and A4 4 is a normal subgroup of <S 4 . Finally, 
D 8 = C 2 ■ W 4 so by associativity, S 4 = C 2 • ^4 = C 2 • W 4 ■ C 3 = D 8 ■ C 3 . □ 

This lemma has two useful corollaries: 

Corollary 6.6. For any subset S c £4, we have {yhs : y G D 8/ h G C 3 ,s gS) = S 4 . 

Corollary 6.7. Let M G R" 2x " 2 . Let the matrix AT be given by 

1 1 j 

M' = { -U{m + 0-M + 0 2 -M)n+~ u(m + 0-M + 0 2 -m) n. 

Then M' G SymM. 

Proof. Observe first that M + a- A4 + cr 2 -M = Yj n ec 3 71 ’ A4- For arbitrary N G R" Zx ”‘, we show 
that |ITNIT + Ifl/V 7 !! = l J^neVs 71 ■ N. First, conjugation by II corresponds to averaging 
M over the group ((12), (34)) generated by interchange of indices in row and column 
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indexing pairs, individually. At the same time, N + N T is the average of M over the matrix 
transposition permutation group <(13)(24)). All together. 


geO 8 heC 3 neS 4 


and so M' G SymM. □ 

We make an useful observation about the nontrivial permutations of M, in the special 
case that M = AA T for some 3-tensor A. 


Lemma 6.8. Let Abe a 3-tensor and let A G IR" 2 *" be its flattening, where the first and third modes 
lie on the longer axis and the third mode lies on the shorter axis. Let Aj be the nxn matrix slices of 
A along the first mode, so that 

(A, 


k An , 

2 2 

Let P : —» IR" be the orthogonal linear operator so that [Px](i, j) = x( j, i). Then 


o ■ AA t 



and 


a 2 ■ AA t = Y' i Aj 0 Aj. 

i 


Proof. We observe that AA T [(j lr j 2 ), (; 3 , jf)\ = E, A inh A lhh and that (E,A: ® 
Aj)[(j i, if), (j 3 , ; 4 )] = Ei Aij in Aij 2 j 4 . Multiplication by P on the right has the effect of switching 
the order of the second indexing pair, so [(E; A ® A-)P][(/i, ji), (js, jf)] = E From 

this it is easy to see that o ■ AA T = (234) • AA T = (E; A, (8) Aj)P. 

Similarly, we have that 

(a 2 AA T )[(/i, ; 2 ), (]3, ]f)\ = ((243) • AA t )[(/i, ; 2 ), (js, jff] = ^ A ;/1 h A ihn , 

k 


from which we see that a 2 • AA T = E, A, (8) Ah □ 

Permutations of the Identity Matrix. The nontrivial permutations of Id„ 2 X „2 are: 

id[(j,k),(f,k')] = 5(j,mr,k') 
o-ld[(j,k),(f,k')] = 5(j,f)5(k,k') 
a 2 -ld[(j,k),(f,k')] = 6(j,k')6(f,k). 

Since (Id+cr-Id+Ald) is invariant under the action of D%, we have (Id+aTd+Ald) G Sym M; 
up to scaling this matrix is the same as Id sym defined in (6.1). We record the following 
observations: 

— Id, a ■ Id, and CT 2 • Id are all symmetric matrices. 
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— Up to scaling. Id + 0 2 Id projects to identity on the canonical embedding of IR[.t] 2 . 

— The matrix o • Id is rank-1, positive-semidefinite, and has n(cr • Id)IT = a ■ Id. 

— The scaling [l/(n 2 + 2n)](Id + crTd-l-a 2 Td) is equal to a linear functional IE: IR[x ] 4 —» 1R 
giving the expectation under the uniform distribution over the unit sphere S” _1 . 

6.4 Construction of Initial Pseudo-Distributions 

We begin by discussing how to create an initial guess at a pseudo-distribution whose 
third moments are highly correlated with the polynomial A(x). This initial guess will 
be a valid pseudo-distribution, but will fail to be on the unit sphere, and so will require 
some repairing later on. For now, the method of creating this initial pseudo-distribution 
involves using a combination of symmetrization techniques to ensure that the matrices we 
construct are well defined as linear functionals over polynomials, and spectral techniques 
to establish positive-semidefiniteness of these matrices. 


6.4.1 Extending Pseudo-Distributions to Degree Four 

In this section we discuss a construction that takes a linear functional X : IR[x ] <3 —» 1R 
over degree-3 polynomials and yields a degree-4 pseudo-distribution {x}. We begin by 
reminding the reader of the Schur complement criterion for positive-semidefiniteness of 
block matrices. 


Theorem 6.9. Let M be the following block matrix. 


M 


def 


B C T \ 
C D ) 


where B > 0 and is full rank. Then M > 0 if and only ifD > CB 1 C T . 

Suppose we are given a linear functional X : R[je ] <3 —> IR with X1 = 1. Let X h be X 
restricted to 1 R[x]i and similarly for X b and X | 3 . We define the following matrices: 

— M £h G IR' ,xl is the matrix representation of X l i. 

— M Lh G IR' ,X,! is the matrix representation of X U 

— M_r | 3 G IR' rx " is the matrix representation of X b- 

— V_r\ 2 G 1R ” 2x1 is the vector flattening of M £ \ 2 . 

Consider the block matrix M G ]R #tu P les ( 2 ) x#tu P les ( 2 ) given by 


M 


def 


1 M T V T X 

V XIi v £\ 2 

M £h M £ , 2 M T £h 
V £ \ 2 M £ | 3 D 
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with D G R ” 2x " 2 yet to be chosen. By taking 


B = 


1 \ 
Mcii M X | 2 / 



we see by the Schur complement criterion that M is positive-semidefinite so long as 
D > CB _ 1 C r . However, not any choice of D will yield M maximally symmetric, which is 
necessary for M to define a pseudo-expectation operator E. 

We would ideally take D to be the spectrally-least maximally-symmetric matrix so that 
D > CB _ 1 C r . But this object might not be well defined, so we instead take the following 
substitute. 


Definition 6.10. Let X,B,C as be as above. Th e symmetric Schur complement D G Sym CB 1 C T 
is t Y^neSi 71 ' (CB _ 1 C T ) for the least t so that t X n eS i 71 ' (CB _ 1 C T ) > CB _ 1 C T . We denote by E^ 

the linear functional E : R[x ] <4 —» ]R whose matrix representation is M with this choice of 

~ L 

D, and note that E is a valid degree-4 pseudo-expectation. 

Example 6.11 (Recovery of Degree-4 Uniform Moments from Symmetric Schur Comple¬ 
ment). Let X : ]R[x ]<3 —> 1R be given by Xp(x) := E f ' p(x). We show that E X = EC In this 
case it is straightforward to compute that CB _ 1 C T = a • Id /n 2 . Our task is to pick t > 0 
minimal so that ^n(Id + o ■ Id + o 2 ■ Id)n > ^Ilja • Id)n. 

We know that n(cr • Id)n = o ■ Id. Furthermore, n Id n = n(a 2 • Id)n, and both are the 
identity on the canonically-embedded subspace R[x ]2 in R #tu P lcs(4) . We have previously 
observed that o ■ Id is rank-one and positive-semidefinite, so let w G ]R #tu P les ( 4 ) be such that 
ww T = a ■ Id. 

We compute ze T (Id + a ■ Id + a 2 ■ Id)ze = 2\\w\\\ + Ww]^ = 2 n + n 2 and w T (o ■ Id)ze = \\w\\\ = n 2 . 
Thus t = n 2 /(n 2 + 2 n) is the minimizer. By a previous observation, this yields EC 

To prove our lower bound, we will generalize the above example to the case that we 
start with an operator £,: R[x ]<3 —> R which does not match E <u on degree-3 polynomials. 


6.4.2 Symmetries at Degree Three 

We intend on using the symmetric Schur complement to construct a pseudo-distribution 
from some X. : R[x ]<3 —» R for which £A(r) is large. A good such X will have X XjXjXk 
correlated with XneS^ for all (or many) indices i,j,k. That is, it should be correlated 
with the coefficient of the monomial XiXjXk in A(x). However, if we do this directly by 
setting X XiXjXk = Ait becomes technically inconvenient to control the spectrum of 
the resulting symmetric Schur complement. To this avoid, we discuss how to utilize a 
decomposition of M_r | 3 into nicer matrices if such a decomposition exists. 

Lemma 6.12. Let X : R[*U 3 —^ R, and. suppose that Mx| 3 = ^(M^ + • • • + ) for some 

M\ w .. ■ ,M k £ | G R' ,2xn . Let D\,... ,D k he the respective symmetric Schur complements of the 
family of matrices 

f 1 

< Mxii 


M t 

v -Xli 


^b 


Mx , 2 (M^ |3 )" 


M 


X | 3 


i< 
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Then the matrix 


t 



i=i v 


1 

Mxij 

^Xb 


M ib 


Xb 

X | 3 


M. 

M' 


( M xi 3 ) T 

Di 


\ 


/ 


is positive-semidefinite and maximally symmetric. Therefore it defines a valid pseudo-expectation 
E . (This is a slight abuse of notation, since the pseudo-expectation defined here in general differs 
from the one in Definition 6.10.) 


Proof. Each matrix in the sum defining M is positive-semidefinite, so M > 0. Each Di 
is maximally symmetric and therefore so is £ k i=1 Di- We know that M_r| 3 = X!=i 
\s maximally-symmetric, so it follows that M is the matrix representation of a valid 
pseudo-expectation. □ 


6.5 Getting to the Unit Sphere 

Our next tool takes a pseudo-distribution E that is slightly off the unit sphere, and corrects 
it to give a linear functional £ : IRjx]^ —> IR that lies on the unit sphere. 

We will also characterize how badly the resulting linear functional deviates from the 
nonnegativity condition (£p(x) 2 > 0 for p £ IR[x] <2 ) required to be a pseudo-distribution 

Definition 6.13. Let X: lR[x]^ —* IR. We define 

, r def . £p(x) 2 

Amin X — min ,, 

peRM<i /2 E f p(x) 2 

where E f ' p(x) 2 is the expectation of p(x) 2 when x is distributed according to the uniform 
distribution on the unit sphere. 

Since E f ' p(x) 2 > 0 for all p, we have £p(x) 2 > 0 for all p if and only if A min X > 0. Thus 
X on the unit sphere is a pseudo-distribution if and only if X1 = 1 and A min X > 0. 

Lemma 6.14. Let E : IR[x] <4 —» IR be a valid pseudodistribution. Suppose that: 

1 . c := E ||x||| > 1. 

2. E is close to lying on the sphere, in the sense that there are b\, b 2 , b' 2 > 0 so that: 

(a) |f E \\x\\ 2 Xi - £' Xj\ < b\ for all i. 

(b) |f E ||x|||x z xy - X' x(Xj\ < <5 2 for all i ± j. 

(c) |f E \\x\\ 2 x 2 - £' x 2 \ < b' 2 for all i. 

Let £ : IR[x] <4 —» R be as follows on homogeneous p: 

El 

i]Ep(x) 

l c ~Ep(x)\\x\\ 2 

Then £ satisfies £p(x)(\\x\f - 1) = 0 for all p(x) £ 1R[x]< 2 and has A min X > - 0(n)b\ - 

0(n 3/2 )b 2 - 0(n 2 )b 2 . 


£p(x) 


def 


if deg p = 0 
if deg p = 3,4 
if deg p = 1,2. 
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Proof. It is easy to check that £p(x) (||x||| - 1) = 0 for all p € R[x] <2 by expanding the 
definition of 

Let the linear functional £! : 1R[x]^ 4 —> 1R be defined over homogeneous polynomials p 


as 


[c 


£'p(x) d = 


E p(x) 


[Ep(x)\\x\\l 


if deg p = 0 
if deg p = 3,4 
if degp = 1,2. 


Note that p(x) = c X p(x) for all p e ]R[x]^ 4 . Thus A m in X > A m in £ / c , and the kernel of 
If is identical to the kernel of X. 

In particular, since (||x||y - 1) is in the kernel of If, either A m m If = 0 or 


Amin -C 


peR[x] 


SSwg-uE tpw 


Here p _L (||x || 2 - 1) means that the polynomials p and 11 x 11 2 - 1 are perpendicular in the 
coefficient basis. That is, if p(x) = p 0 + E; P> x i + E/; PijXjXj, this means E,, p n = po- The 
equality holds because any linear functional on polynomials ‘K with (11x11 2 - 1) in its kernel 
satisfies r K(p(x) + n(||x|| 2 -1)) 2 = 'Kp(x) 2 for every a. The functionals If and E f ' in particular 
both satisfy this. 

Let A := If - E, and note that A is nonzero only when evaluated on the degree-1 or -2 
parts of polynomials. It will be sufficient to bound A, since assuming A m i n If I 0, 


Amin X' = 


mm 


/jeR[.rp 2 ,P-L(IMg-l) 




mm 


A p(x) 2 + E p(x) 2 
W p(x) 2 
A p(x) 2 


peK[xU 2 ,p±(\\x\\l-l) E fi p(x) 2 


Let p G IR[x] <2 . We expand p in the monomial basis: p(x) = po + E; PiX / + E;,; PijXiXj . Then 


71 

\ { 


p(x) 2 = p 2 0 + 2p 0 ^ PiXi + 2p 0 E PtjXjXj + E PiX, I + 2 1 p t Xj E PijXiXj + E PijXiXj 
i 0 ^ ‘ ' ^ > '\ij J V 0 


An easy calculation gives 


E f ' P(*) 2 = P 2 o + ^fYj Vii + lT4^ + 


n 2 + 2 n 


EH + E4 + E 


pi 


The condition p ± (||x||| - 1) yields p 0 = E; Pa- Substituting into the above, we obtain the 
sum of squares 
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Without loss of generality we assume E fl p(x) 2 = 1, so now it is enough just to bound 
A p(x) 2 . We have assumed that |Ax,| < cb\ and |Ax,x y | < cb 2 for i ± ] and \Ax 2 \ < cb' r We also 
know A1 = c — 1 and A p(x) = 0 when p is a homogeneous degree-3 or -4 polynomial. So we 
expand 

A p(x) 2 = p 2 0 (c - 1) + 2p 0 ^ pAxi + 2p 0 ^ ptjAxiXj + ^ p i p l Ax i x j 

i ij hi 

and note that this is maximized in absolute value when all the signs line up: 

2 

+a 2 Yjrt- 

i 

We start with the second term. If p 2 - a for a € [0,1], then YLiP] ^ w(l - a) by our 
assumption that E f ' p(x) 2 = 1. This means that 

2cb 1 \p 0 \ ^ \pi\ < 2c6i Ian ^ pf < 2cb\n y[a(lAa) < 0(n)cb 1 , 

where we have used Cauchy-Schwarz and the fact max 0 < ff< i a( 1 - a) = (1/2) 2 . The other 
terms are all similar: 


|Ap(x) 2 | < pJ(c-l)+2c6i|p 0 |J^ \Pi\+2\p 0 \ cb 2 J^\pij\ + cb' 2 J^\pn\ +cb 2 p 




p 2 o(c - 1 ) < C - 1 

2|/7o|c5 2 ^ | Pij\ < 2 cb 2 an 2 ^ p 2 . < 2 cb 2 0(n 2 ) y[a( T - a) < 0(n 2 )cb 2 

i*j \J ij 

2\p 0 \cb' 2 \pa\ < 2c6' Ian pi < 0(n 3/2 )cb 2 
2 

< c<5 2 n 

i 

cb 2 V 2 i < 0{n)cb 2 , 

i 

where in each case we have used Cauchy-Schwarz and our assumption E" p(x) 2 = 1. 
Putting it all together, we get 

A m in A > -(c - 1) - 0(n)cbi - 0{n 3,2 )cb’ 2 - 0{n 2 )cb 2 . □ 

6.6 Repairing Almost-Pseudo-Distributions 

Our last tool takes a linear functional A. : IR[x]^ that is "almost" a pseudo-distribution 
over the unit sphere, in the precise sense that all conditions for being a pseudo-distribution 
over the sphere are satisfied except that A min X = —£. The tool transforms it into a bona 
fide pseudo-distribution at a slight cost to its evaluations at various polynomials. 

Lemma 6.15. Let X : —> IR and suppose that 


cb 2 


X> 
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— £1 = 1 

— £p(x)(||x|| 2 - 1) = 0 for all p e lR[x] <d _ 2 - 


Amin £ £• 

Then the operator E : R[x£ 4 —> R given by 

d = ^^(£p(x) + eE f 'p(x)) 
is a ya/zd pseudo-expectation satisfying {||x|| 2 = 1}. 

Proof It will suffice to check that A m m E > 0 and that E has E(||x|| 2 - l) 2 = 0 and E 1 = 1. 
For the first, let p G R[x] >2 . We have 


Ep(x ) 2 

E fl p(x) 2 


l + £ 


'E 0 p{x) 2 + £ E fl p{x) 2 
v E f ' p(x) 2 




l + £ 


(-£ + e) > 0. 


Hence, A m i n E > 0. 

It is straightforward to check the conditions that E 1 = 1 and that E satisfies {||x|| 2 -1 = 0}, 
since E is a convex combination of linear functionals that already satisfy these linear 
constraints. □ 


6.7 Putting Everything Together 

We are ready to prove Theorem 6.3 and Theorem 6.4. The proof of Theorem 6.3 is somewhat 
simpler and contains many of the ideas of the proof of Theorem 6.4, so we start there. 

6.7.1 The Degree-4 Lower Bound 

Proof of Theorem 6.3. We begin by constructing a degree-4 pseudo-expectation E° : R[.r£ 4 —> 
R whose degree-4 moments are biased towards A(x) but which does not yet satisfy 
{||*||| -1 = 0 ). 

Let £ : R[x]^ 4 —> R be the functional whose matrix representation when restricted to 
£ | 4 : R[x] 4 —> R is given by M_c| 4 = £ neiS4 A n , and which is 0 on polynomials of degree 

at most 3. 

Let E := E f ' +e £, where e is a parameter to be chosen soon so that E p(x) 2 > 0 for all p G 
R[x]^ 2 . Let p G R[x]^ 2 . We expand p in the monomial basis as p(x) = p 0 + £,• PiX, + ffy Vij x i x j- 
Then 

h 

By our assumption on negative eigenvalues of A n for all n G <S 4 , we know that £ p(x) 2 > 
Yuij P^j- So if we choose £ < 1/A 2 , the operator E° = E/ 1 + £ /A 2 will be a valid pseudo¬ 
expectation. Moreover E° is well correlated with A, since it was obtained by maximizing 
the amount of £, which is simply the (maximally-symmetric) dual of A. However the 
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calculation of E 11 x 111 shows that this pseudo-expectation is not on the unit sphere, though 
it is close. Let c refer to 


c := E 


\ = W\\x\W+-£ 


4 — 1+ 1 

2 \S 4 \n 2 A 2 


£<LW m ,A*> = 1 + 0(n~ 1/2 ). 


TCESa 


We would like to use Lemma 6.14 together with E to obtain some X 1 : R[x] 


^4 


R 


with ||x|| 2 - 1 in its kernel and bounded A m j n X 1 while still maintaining a high correlation 
with A. For this we need £\, £i, £' 2 so that 


■E 


2 if: u 

ipCi - E x t 


< for all i. 


4 E° 

C 


;XiXj - E XiX, 


< £2 for all i ± ). 


4 E° 

C 


2 x 2 -Ex 2 

2 i i 


< £' for all i. 


Since E p(x) = 0 for all homogeneous odd-degree p, we may take £\ = 0. For £ 2 , we 
have that when i 4 j, 

1 


±E 


\xiXj - E v XiXj 


cA 2 


XI 


II \XiXj 


< 62 , 


where we recall 62 and b' 2 defined in the theorem statement. Finally, for £' v we have 


- E 

C 


I 2 X 2 - E° x 2 






| 2 x 2 


+ | - E^ HxlljX 2 - E f ' x 2 | < <5( + c -d ■ 


| 2 - 1 in its kernel in the sense that 


Thus, Lemma 6.14 yields X 1 : R|XU 4 —> R with 
X 1 p(x)(||x|| 2 - 1 ) = 0 for all p £ R[x] <2 - If we take £ 2 = 6 2 and £ 2 = b' 2 + ff> then A min X 1 > 
-y- _ n 2{) 2 _ n 3,2 (b 2 + yy) = -0(1). Furthermore, X 1 A(x) = ^XA(x) = @(^ X A(x)). 

So by Lemma 6.15, there is a degree-4 pseudo-expectation E satisfying {||x || 2 = 1} so that 


E A(x) = 0 ( — X A(x) ] + 0(E f ' A(x)) 


= 0 


\S 4 \n 2 A 2 


(A, £ A”) 


ueSa 


+ 0(E f ' A(x)) 




n 

Y 2 


+ 0(E f ' A(x)). 


□ 


6.7.2 The Degree-3 Lower Bound 

Now we turn to the proof of Theorem 6.4. 

Proof of Theorem 6.4. Let A be a 3-tensor. Let e > 0 be a parameter to be chosen later. We 
begin with the following linear functional X : R[x ] c3 —> R. For any monomial x a (where a 
is a multi-index of degree at most 3), 

def f E^ x a if deg x a < 2 

\^ 72 E ne 53 A a if deg x a = 3 
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The functional X contains our current best guess at the degree 1 and 2 moments of a 
pseudo-distribution whose degree-3 moments are e-correlated with A(x). 

The next step is to use symmetric Schur complement to extend X to a degree-4 
pseudo-expectation. Note thatMx| 3 decomposes as 

M a = X rw 

neS 3 


where, as a reminder, A n is the n 2 x n flattening of A n and n is the projector to the canonical 

2 

embedding of IRJxh into 1R" . So, using Lemma 6.12, we want to find the symmetric Schur 


complements of the following family of matrices (with notation matching the statement of 
Lemma 6.12): 


1 v k 

■ M a M £k 3i(rM") T 

T £ rr An 


Since we have the same assumptions on A n for all n G S 3 , without loss of generality we 
analyze just the case that n is the identity permutation, in which case A n = A. 

Since X matches the degree-one and degree-two moments of the uniform distribution 
on the unit sphere, we have = 0, the n-dimensional zero vector, and M_r \ 2 = ^ ld nxn . 
Let w G R' r be the n 2 -dimensional vector flattening of Id, IX ,!. We observe that zvzv T = o • Id 
is one of the permutations of ld n i xn 2 . Taking B and C as follows. 


B = 


1 0 \ 
0 \ Id^xn / 



we compute that 

1 c 2 

CB^C 7 = —(a ■ Id) + — nAA T n. 
n 2 n 2 

Symmetrizing the Id portion and the AA T portion of this matrix separately, we see 
that the symmetric Schur complement that we are looking for is the spectrally-least 
M G Sym (^(<7 • Id) + ^AA T ^j so that 


M = ^ 3 Id s y m + y (n(AA T + a ■ AA 7 + o 2 ■ AA 7 )U + Tl(AA 7 + a ■ A A 7 + a 2 ■ AA T ) T n) 

1 f 2 

> —(a-Id) + —nAA r n. 

n 2 n 2 

Here we have used Corollary 6.7 and Corollary 6.6 to express a general element of 
Sym(^ITAA r n) in terms of n, AA 7 , o ■ AA 7 , and cj 2 • AA 7 . 

Any spectrally small M satisfying the above suffices for us. Taking t — 1, canceling 
some terms, and making the substitution 3 Id sym - o ■ Id = 2n Id n, we see that it is enough 
to have 
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which by the premises of the theorem holds for e = 1 / A. Pushing through our symmetrized 
Schur complement rule with our decomposition of M_f| 3 (Lemma 6.12), this e yields a valid 
degree-4 pseudo-expectation E° : IR[x] <4 —» 1R. From our choice of parameters, we see that 
E° | 4 , the degree-4 part of E°, is given by E° | 4 = u -fr 1 E fl + X, where X : IR[x] 4 —■> 1R is as 
defined in the theorem statement. Furthermore, E ° p(x) = E' J p(x) for p with deg p < 2. 

We would like to know how big E° ||x||| is. We have 

c := K° Ml! = (l + i) E" ||x||‘ + XIMI* = 1 + - + X ||x||‘. 

V n / n 

We have assumed that (Id sym ,AA T ) < 0(A 2 n 2 ). Since Id sym is maximally symmetric, we 
have (Id sym , Xn 6 5 4 71 ' AA r ) = (Id sym , I^IAA 7 ) and so 

-CIMlS = A.<Id ! >"\M, I( > = X n-AA T )) < 0(1). 

neSi 

Finally, our assumptions on (A, Z neiS3 A") yield 

e“a(*) = -^<a2>”>»o 

neS 3 



We have established the following lemma. 

Lemma 6.16. Under the assumptions of Theorem 6.4 there is a degree-4 pseudo-expectation operator 
E° so that 


— c := E° ||*||* = 1 + 0(1). 

— E° A(x) > Q(n 3 / 2 /A). 

— E °p(x) = E ,; p(x)for all p e IR[x] <2 . 

— E°| 4 = (l + i)EH4 + X. □ 

Now we would like feed E° into Lemma 6.14 to get a linear functional X 1 : IR[x] <4 1R 
with ||x||| - 1 in its kernel (equivalently, which satisfies {||x||| - 1 = 0}), but in order to do 
that we need to find <Ea, £ 2 , £ 2 so that 


- E° 

C 


2 Tr7 U 

pCi - E x t 


< for all i. 


■E 


2 rf? 0 

iXjXj - E 


< £ 2 for all i + ). 


- E° 

C 


\x 2 - E x 2 


< for all i. 


For we note that for every i, E° *,■ = 
degree one and two polynomials. Thus, 


0 since E° matches the uniform distribution on 
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~ 0 i 

We know that ALpo. , the matrix representation of the degree-3 part of E , is 3/ , A. 


E b 


|>S 3 |« 3/2 A 


Expanding E ||x |\ 2 Xi with matrix representations, we get 


\ E° | Xi 


1 


|S 3 |cn 3 / 2 A 


(Id 

nxn / L Ai) 


< 6i, 


716 «Ss 


where <5 4 is as defined in the theorem statement. 

Now for £ 2 and . Let X be the operator in the theorem statement. By the definition of 
E°, we get 

In particular, for i ± j, 


E°| 4 < 


1 + — )E M | 4 + £ 
n 


■E 


112 x(Xj - E x(Xj 


X ||x|||x,x 7 


A 62 . 


For i = j, 


~ E° ||x|||xf - E f ' x] 


£ ||x|||x? + |l + - j E f ' ||x|||xf - cW x] 


<5 2 . 


£ IMgx? - J XIMg + i X IMg + \ (l + ^ 
1 XlWllac?-JXIMIJ + 

XllxNlxf - JXIIxNl 


l| 2„2 1 r II—1|4 , 1 T&° 11-114 _ c E ^ x 2 


I 2 - c E fl x 2 


Thus, we can take £1 = S 4 , £ 2 = 6 2 , £ 2 = 63 , an d c = E° ||x||| = 1 + 0(1), and apply Lemma 6.14 
to conclude that 


A min £ 1 >- c -^- 0(n )£ 4 - 0(n 3 ' 2 )£ 2 - 0(n 2 )E. 2 = -0(1). 

The functional £ x loses a constant factor in the value assigned to A(x) as compared to E°: 

X'AW^^f 


A 


Now using Lemma 6.15, we can correct the negative eigenvalue of £} to get a pseudo¬ 
expectation 

E^OOOi ; 1 +0(1) E f ‘ . 

By Lemma 6.15, the pseudo-expectation E satisfies {||x||| = 1}. Finally, to complete the 
proof, we have: 

/ 7 W 2 \ 

EA(x) = Q — +0(l)E^A(x). □ 
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7 Higher-Order Tensors 

We have heretofore restricted ourselves to the case k = 3 in our algorithms for the sake of 
readability. In this section we state versions of our main results for general k and indicate 
how the proofs from the 3-tensor case may be generalized to handle arbitrary k. Our policy 
is to continue to treat k as constant with respect to ft, hiding multiplicative losses in k in 
our asymptotic notation. 

The case of general odd k may be reduced to k = 3 by a standard trick, which we describe 
here for completeness. Given A an order-fc tensor, consider the polynomial A(x) and make 
the variable substitution yp = for each multi-index /3 with |/3| — (k + l)/2. This yields 
a degree-3 polynomial A'(x, y) to which the analysis in Section 3 and Section 4 applies 
almost unchanged, now using pseudo-distributions {x, y] satisfying {||x|| 2 = 1, ||y|| 2 = 1}. 
In the analysis of tensor PCA, this change of variables should be conducted after the 
input is split into signal and noise parts, in order to preserve the analysis of the second 
half of the rounding argument (to get from E(u 0 , x) k to IE(z; 0 , x)), which then requires only 
syntactic modifications to Lemma A.5. The only other non-syntactic difference is the need 
to generalize the A-boundedness results for random polynomials to handle tensors whose 
dimensions are not all equal; this is already done in Theorem B.5. 

For even k, the degree-/c SoS approach does not improve on the tensor unfolding 
algorithms of Montanari and Richard [MR14], Indeed, by performing a similar vari¬ 
able substitution, yp = for all 1/31 = /c/2, the SoS algorithm reduces exactly to the 
eigenvalue/eigenvector computation from tensor unfolding. If we perform instead the 
substitution yp = x^ for |/3| = /c/2 - 1, it becomes possible to extract v 0 directly from the 
degree-2 pseudo-moments of an (approximately) optimal degree-4 pseudo-distribution, 
rather than performing an extra step to recover Vq from v well-correlated with v® k/2 . Either 
approach recovers v 0 only up to sign, since the input is unchanged under the transformation 
V 0 I * -Vq. 

We now state analogues of all our results for general k. Except for the above noted 
differences from the k = 3 case, the proofs are all easy transformations of the proofs of their 
degree-3 counterparts. 

Theorem 7.1. Let k be an odd integer, Vo e IR ,! a unit vector, t > n <r/4 log(n) 1/4 /e, and A an 
order-k tensor with independent unit Gaussian entries. 

1. There is an algorithm, based on semidefinite programming, which on input T(x) = t • (v 0r x) k + 
A(x) returns a unit vector v with (vq,v) > 1 - e with high probability over random choice of 

A. 

2. There is an algorithm, based on semidefinite programming, which on input T(x) = t • (v 0 , x) k + 
A(x) certifies that T(x) < t • (v, x) k + 0(n k/ 4 log (n) 1/4 ) for some unit v with high probability 
over random choice of A. This guarantees in particular that v is close to a maximum likelihood 
estimator for the problem of recovering the signal Vofrom the input t • v® k + A. 

3. By solving the semidefinite relaxation approximately, both algorithms can be implemented in 
time 0 (m 1+1/k ), where m = n k is the input size. 
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For even k, the above all hold, except now we recover v with (v 0r v) 2 > 1 - e, and the algorithms 
can be implemented in nearly-linear time. 

The next theorem partially resolves a conjecture of Montanari and Richard regarding 
tensor unfolding algorithms for odd k. We are able to prove their conjectured signal-to-noise 
ratio t, but under an asymmetric noise model. They conjecture that the following holds 
when A is symmetric with unit Gaussian entries. 

Theorem 7.2. Let k be an odd integer, v 0 e R" a unit vector, t > n k/4 /e, and A an order-k 
tensor with independent unit Gaussian entries. There is a nearly-linear-time algorithm, based on 
tensor unfolding, which, with high probability over random choice of A, recovers a vector v with 
(V,V 0 ) 2 >1- £. 


8 Conclusion 

Open Problems 

One theme in this work has been efficiently certifying upper bounds on homogeneous 
polynomials with random coefficients. It is an interesting question to see whether one can 
(perhaps with the degree d > 4 SoS meta-algorithm) give an algorithm certifying a bound 
of n 3 l 4 ~ 6 over the unit sphere on a degree 3 polynomial with standard Gaussian coefficients. 
Such an algorithm would likely yield improved signal-to-noise guarantees for tensor PCA, 
and would be of interest in its own right. 

Conversely, another problem is to extend our lower bound to handle degree d > 4 SoS. 
Together, these two problems suggest (as was independently suggested to us by Boaz 
Barak) the problem of characterizing the SoS degree required to certify a bound of /W 4-6 as 
above. 

Another problem is to simplify the linear time algorithm we give for tensor PCA under 
symmetric noise. Montanari and Richard's conjecture can be interpreted to say that the 
random rotations and decomposition into submatrices involved in our algorithm are 
unnecessary, and that in fact our linear time algorithm for recovery under asymmetric 
noise actually succeeds in the symmetric case. 
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A Pseudo-Distribution Facts 

Lemma A.l (Quadric Sampling). Let {x| be a pseudo-distribution over R" of degree d > 2. 
Then there is an actual distribution {y} over R" so that for any polynomial p of degree at most 2, 
E [p(y)] = E [p(x)]. Furthermore, \y) can be sampled from in time poly n. 
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Lemma A.2 (Pseudo-Cauchy-Schwarz, Function Version, [BBH + 12]). Let x, y be vector-valued 
polynomials. Then 

(x,y) < ^(||x|| 2 + ||y|| 2 ). 

See [BKS14b] for the cleanest proof. 

Lemma A.3 (Pseudo-Cauchy-Schwarz, Powered Function Version). Let x, y be vector-valued 
polynomials and d > 0 an integer. Then 

(x,y) d <\( HxT + llyT). 

Proof. Note that (x, y) d = (x® d , y® d ) and apply Lemma A.2. □ 

Yet another version of pseudo-Cauchy-Schwarz will be useful: 

Lemma A.4 (Pseudo-Cauchy-Schwarz, Multiplicative Function Version, [BBH + 12]). Let 
{x, y) be a degree d pseudo-distribution over a pair of vectors, d > 2. Then 

«<*,!/>]< Jmm yjmwn 

Again, see [BKS14b] for the cleanest proof. 

We will need the following inequality relating E(x, u 0 ) 3 and E(x, v 0 ) when E(x, u 0 ) 3 is 
large. 

Lemma A.5. Let {x} be a degree-4 pseudo-distribution satisfying {||x|| 2 = 1}, and let v 0 e 1R" be a 
unit vector. Suppose that E(x,u 0 ) 3 > 1 — £ for some e > 0. Then E(x,u 0 ) > 1 - 2e. 

Proof. Let p(u) be the univariate polynomial p(u) = 1 - 2 u 3 + u. It is easy to check that 
p(u) > 0 for u G [—1,1], It follows from classical results about univariate polynomials that 
p(u ) then can be written as 

p(u) = S 0 (u ) + Si(m)(1 + u) + s 2 (w)( 1 - u) 

for some SoS polynomials s 0 ,si,s 2 of degrees at most 2. (See [OZ13], fact 3.2 for a precise 
statement and attributions.) 

Now we consider 

Ep«x,u 0 )) > E[si«x,u 0 ))(l + <V^o»] + E[s 2 «x,u 0 ))(1 - <x,u 0 »]. 

We have by Lemma A.2 that (x,v 0 ) < \{\\x\\ 2 + 1) and also that (x,v 0 ) > -|(||x|| 2 + 1). 
Multiplying the latter SoS relation by the SoS polynomial Si((x,i7 0 )) and the former by 
s 2 ((x, Vq )), we get that 

E[s 1 «x,u 0 ))(l + (x, Uo))] = E[s 1 «x,i? 0 ))]+E[s 1 «x,i;o))<i,Uo)] 

> E[ Sl «X,Z7o))] - ^E[si«x,Uo»(||x|| 2 + 1)] 

> E[ Sl «x,Uo))] - E[s!«x,i;o»] 
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> 0, 

where in the second-to-last step we have used the assumption that {x} satisfies {11x11 2 = 1}. 
A similar analysis yields 

E[s 2 «x,A)»(1 - (x,v Q ))] > 0 ■ 

All together, this means that E p((x, zy)) > 0. Expanding, we get E[1 - 2{x, v 0 ) 3 + (x, zy)] > 0. 
Rearranging yields 

E(x, zy) > 2 E(x, zy ) 3 - 1 > 2(1 - £) - 1 > 1 - 2e. □ 


We will need a bound on the pseudo-expectation of a degree-3 polynomial in terms of 
the operator norm of its coefficient matrix. 

Lemma A.6. Let {x} be a degree-4 pseudo-distribution. Let M e IR” 2 *”. Then E(x® 2 , Mx) < 
l|M||(E ||x|| 4 ) 3/4 . 


Proof. We begin by expanding in the monomial basis and using pseudo-Cauchy-Schwarz: 

E(x 02 ,Mx) = E y Mdfi'jXjXjXk 

ijk 

— E X; M^iiXjXk 

i jk 




f A 

2- 

«(E|M| 2 ) 1/2 

®e 

M-(j,k),iXiXj 



i 

k i k 




< \ 

2~ 

«(E|MI 4 ) 1/4 

®E 

i 

-^OX) ,AAy 

, i k 



We observe that MM T is a matrix representation of E; (Ey . We know MM 1 

||M|| 2 Id, so 

( 

«E E M^x.xj I < ||M|| 2 E||x|| 4 . 

' V i k 

Putting it together, we get E(x® 2 ,Mx) < ||M||(E ||x|| 4 ) 3/4 as desired. 


c 


□ 


B Concentration bounds 

B.l Elementary Random Matrix Review 

We will be extensively concerned with various real random matrices. A great deal is known 
about natural classes of such matrices; see the excellent book of Tao [Taol2] and the notes 
by Vershynin and Tropp [Verll, Trol2]. 
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Our presentation here follows Vershynin's [Verll]. Let X be a real random variable. 
The subgaussian norm ||X ||^, 2 of X is sup p>1 p~ 1 / 2 (E\X\ p ) 1/p . Let {a} be a distribution on IR". 
The subgaussian norm ||a ||^ 2 of {a\ is the maximal subgaussian norm of the one-dimensional 
marginals: ||fl||^ 2 = sup| |(( || = 1 1| (a, i/)||^, 2 . A family of random variables {X„}„ e]N is subgaussian 
if ||X B ||^ = 0(1). The reader may easily check that an n-dimensional vector of independent 
standard Gaussians or independent ±1 variables is subgaussian. 

It will be convenient to use the following standard result on the concentration of 
empirical covariance matrices. This statement is borrowed from [Verll], Corollary 5.50. 

Lemma B.l. Consider a sub-gaussian distribution {a} in R m with covariance matrix E, and let 
<5 € ( 0 , 1 ), f > 1. If fli, ... ,a N ~ ]a] with N > C(t/5) 2 m then ||^ Z a^J - E|| < 5 with probability 
at least 1 - 2exp(-f 2 m). Here C = C(K) depends only on the sub-gaussian norm K = \\a\L 2 of a 
random vector taken from this distribution. 

We will also need the matrix Bernstein inequality. This statement is borrowed from 
Theorem 1.6.2 of Tropp [Trol2], 

Theorem B.2 (Matrix Bernstein). Let Si,..., S m be independent square random matrices with 
dimension n. Assume that each matrix has bounded deviation from its mean: ||S; - E S,|| < Rfor 
all i. Form the sum Z = Z, S, and introduce a variance parameter 

o 2 = max]11 E(Z - E Z)(Z - E Z) T ||, || E(Z - E Z) T (Z - E Z)||}. 

Then 

P{||Z - E Z|| > f] < 2 n exp + j^/ 3 ) f or allt >°- 

We will need bounds on the operator norm of random square rectangular matrices, 
both of which are special cases of Theorem 5.39 in [Verll], 

Lemma B.3. Let Abeannxn matrix with independent entries from N(0, 1). Then with probability 
1 - ip e operator norm \\A\\ satisfies ||A|| < 0( yfn). 

Lemma B.4. Let A be an n 2 x n matrix with independent entries from N( 0,1). Then with 
probability 1 - 1 p e operator norm \\A\\ satisfies \\A\\ < 0(n). 


B.2 Concentration for Z, M <S> and Related Ensembles 


Our first concentration theorem provides control over the nontrivial permutations of the 
matrix AA T under the action of 1 S 4 for a tensor A with independent entries. 

Theorem B.5. Let c £ {1,2} and d > 1 an integer. Let A \,..., A„c be iid random matrices in 
{±1 \ n ‘' xnd or with independent entries from N(0, 1). Then, with probability 1 - O(n~ wo ), 


and 


Y' i At ® Aj - E Ai ® Ai 

ie[n c ] 


< Vdn (2rf+c)/2 -(logn) 1/2 . 


Y, Ai 0 Aj -E Ai®A\ 


ie[n c 


< 


y^ n (2d+c)/2 . (logft)b 2 . 
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We can prove Theorem 3.3 as a corollary of the above. 

Proof of Theorem 3.3. Let A have iid Gaussian entries. We claim that E A 0 A is a matrix 
representation of ||x|| 4 . To see this, we compute 


(/ 2 ; E(A®A)/) 


E(x, Ax) 2 
y E AijA k iXiXjXkXi 


E 


2 2 


Now by Theorem B.5, we know that for A, the slices of the tensor A from the statement of 
Theorem 3.3, 

y Aj 0 A, < n E A (8) A + A 2 • Id 

i 

for A = 0(n 3 A log(n) 1 ^ 4 ). Since n = 0(A) and both Id and E A 0 A are matrix representations 
of ||x|| 4 , we are done. □ 

Now we prove Theorem B.5. We will prove only the statement about ff l A l 0 A,-, as the 
case of YjiAj <8>Af is similar. 

Let A\,... ,A n c be as in Theorem B.5. We first need to get a handle on their norms 
individually, for which we need the following lemma. 

Lemma B.6. Let Abe a random matrix in {±i} nCixnd or with independent entries from N{ 0,1). For 
all t > 1, the probability of the event {||A|| > tn d/2 } is at most 2~ t ~' ld/K for some absolute constant K. 

Proof The subgaussian norm of the rows of A is constant and they are identically and 
isotropically distributed. Hence Theorem 5.39 of [Verll] applies to give the result. □ 

Since the norms of the matrices A\,... ,A n c are concentrated around ;W 2 (by Lemma B.6), 
it will be enough to prove Theorem B.5 after truncating the matrices A\,... ,A n c. For £ > 1, 
define iid random matrices A(,..., A' nC such that 

A , def (Aj if ||A/|| < tn d/2 , 

10 otherwise 

for some t to be chosen later. Lemma B.6 allows us to show that the random matrices 
Aj 0 Aj and A' 0 A' have almost the same expectation. For the remainder of this section, let 
K be the absolute constant from Lemma B.6. 

Lemma B.7. For every i e [ n c ] and all t > 1, the expectations of Aj 0 Aj and A' 0 A' satisfy 

||E[A/ 0 Aj\ - E[A; 0 A']|| < 0(1) • 2~ tnd/K . 
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Proof. Using Jensen's inequality and that A,- = A' unless ||A,|| > tn d/1 , we have 
|| E Aj 0 A, - A- 0 A'11 < E 11A/ 0 A,- - A' 0 A'|| Jensen's inequality 

= I F(||A,|| > Vs)ds since A,- = A' unless ||A,J| > tn d/1 

J tn d I 1 

/~%oo 

< I 2~^ K ds by Lemma B.6 

)-tn d > 2 /k . 2 _z/k discretizing the integral 


<=o 

■)-tn d l 2 /K 


= 0(2 '" /x ) as desired. 


□ 


Lemma B.8. Let B],..., B' ;1 . be u.d matrices such that B' = A' (2) A' - E[A' <g> A']. Then for every 
C > 1 zezBz C < 3t 2 n c/2 , 


F 


J^B' > C-n (2d+c)/2 l < 2n 2d -exp(-^\ . 
;e[n c ] J ' ' 


Proof For R = 2t 2 n d , the random matrices B',..., B' ;C satisfy {||B'|| < R] with probability 1. 
Therefore, by the Bernstein bound for non-symmetric matrices [Trol2, Theorem 1.6], 


F 


E n c 


> s 1 < 2 n 2d ■ exp 


-s 2 /2 


[ o 2 +Rs/3 ' 


where cr 2 = max{||E, E B'(B') T ||, HE* E(B') T B'||( < n c ■ R 2 . For s = C ■ zF 2d +c)/ 2 , the probability 
is bounded by 


F 


lE”=i^ll >s } <2w2 "' exp (: 


-C 2 . n ( 2 d+c) j 2 


^4f 4 • n 2d+c + 2 t 2 C ■ z# rf+c F 2 /3 j 
Since our parameters satisfy t 2 C ■ n <4d+c A 2 /3 < t 4 n ( - 2d+c \ this probability is bounded by 




□ 


At this point, we have all components of the proof of Theorem B.5. 
Proof of Theorem B.5 for E, A, 0 A, (other case is similar). By Lemma B.8, 

P {lE/ A <® A>i ~ E; E ^ ® > C ' n(2rf+c)/2 | < 2n 2d ’ ex P 

At the same time, by Lemma B.6 and a union bound, 

F [A, = A [,..., A n = Af] > 1 -n c - 2“ tV/K . 

By Lemma B.7 and triangle inequality. 


-C 2 
Kt 4 


|E ; e[a,- ® a,-] - E ( ® a;] 


< n c -2~ tnd/K . 
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Together, these bounds imply 


F \\Y i A l ®A i - Y' I E [Aj 0 Aj\ 


> 


C ■ if 2d+c)/2 + n c ■ 2~ tnC,,K 


-C 


< 2 n 2d ■ exp ( — ] + n c • 2“ f ” /K . 


We choose t — 1 and C = 100 JlKd log n and assume that n is large enough so that 
C • n (2d+c)/2 > n c ■ 2~ tnd ^ K and 2 n 2d ■ exp > n c ■ Then the probability satisfies 


P 


Y Ji Ai 0 Ai - Y Ji E[A,- (8 A f ]| > 10n {2d+c)/2 ^2Kd\ogn\ < 4 n~ 100 . 


□ 


B.3 Concentration for Spectral SoS Analyses 

Lemma B.9 (Restatement of Lemma 5.5). Let T = t ■ v® 3 + A. Suppose A has independent 
entries from N( 0,1). Then with probability 1 - O(n~ 10 °) we have || £,■ A; 0 A, - E A ; - 0 A;|| < 
0(n 3/2 log(n) 1/2 ) and || Yi Uo(0AII < 0( yfn). 

Proof. The first claim is immediate from Theorem B.5. For the second, we note that since 
Vq is a unit vector, the matrix Yi v 0 (i)Aj has independent entries from 7V(0,1). Thus, by 
Lemma B.3, || Yi Uo(0AII < 0( xfn) with probability 1 - O(n' i0 °), as desired. □ 

Lemma B.10 (Restatement of Lemma 5.10 for General Odd k). Let A be a k-tensor with 
k an odd integer, with independent entries from JV(0,1). Let v 0 G IR" be a unit vector, and 
let V be the n (fc+1)/2 x n (fc_1) / 2 unfolding of v® k . Let A be the n (fc+1)/2 x n^ -1 ^ 2 unfolding of A. 
Then with probability 1 - O(n~ wo ), the matrix A satisfies A T A = n (fc+1)/2 7 + E for some E with 
||E|| < 0(n k/2 log(n)) and \\A T V\\ < 0(n^b /4 log in) 1 ' 2 ). 

Proof With 5 = 0(1/ y/n) and t — 1, our parameters will satisfy ?#' +1 )/2 ^ (f/5)2ft( fc_ l)/2. 
Hence, by Lemma B.l, 


||E|| = ||A T A T - n {k+1)/2 I\\ 


Y i a a a T a - n (fc+1)/2 ■ Id 

|a|=(fc+l)/2 


^n {k+1)/2 -ol^-\ = 0(n k/2 ) 
\ ynf 


with probability at least 1 - 2exp(-rz^ +1 ^ 2 ) > 1 — O(n _10 °). 

It remains to bound ||A T 1/||. Note that V = uw T for fixed unit vectors u G 
and w G R (jc+ 1 ) /2 . So ||A T R|| < ||A t m||. But A T u is distributed according to A/X0,1)” and so 
||A r w|| < 0( yjn log n) with probability 1 - n~ 100 by standard arguments. □ 


B.4 Concentration for Lower Bounds 

The next theorems collects the concentration results necessary to apply our lower bounds 
Theorem 6.3 and Theorem 6.4 to random polynomials. 
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Lemma B.ll. Let A be a random 3-tensor with unit Gaussian entries. For a real parameter 
A, let £ : IR[x ] 4 —> IR be the linear operator whose matrix representation is given by 
M.£ := ILneSi 71 ' AA T . There is A = 0(n 3/2 / log(n) 1/2 ) so that with probability 1 - 0(n~ 5 °) 


the following events all occur for every n e S 3 . 

-2A 2 -nidn 

<^n[o- A n (A n ) T + o 2 ■ A n {A n f + (a • A n (A n ) T ) T + (a 2 • A n (A n ) T ) T ] n (B.l) 

<A,£ A") = Q(n 3 ) (B.2) 

neS 3 

(Id s y m ,A n (A n ) T ) = 0(n 3 ) (B.3) 

n (max -^ 2 <Id nX n / A") ) = °( 1 ) (B.4) 

n 2 (max |X ||x|| 2 x;x ; | j = 0(1) (B.5) 

n 3 ' 2 (max £\\x\\ 2 x 2 ~\£ ||x|| 4 j = 0(l/n) (B.6) 


Proof For (B.l), from Theorem B.5, Lemma 6.8, the observation that multiplication by an 
orthogonal operator cannot increase the operator norm, a union bound over all n, and the 
triangle inequality, it follows that: 

\\o ■ A n (A n ) T - E[cj • A n (A n ) T ] + a 2 • A n (A n ) T - E [a 2 • A n (A n ) T ]\\ < 2A 2 . 

with probability 1 - n~ 100 . By the definition of the operator norm and another application 
of triangle inequality, this implies 

-4A 2 Id <o- A n (A n ) T + u 2 • A n (A n ) T + (a • A n (A n ) T ) T + (a 2 • A n (A n ) T ) T 

- E[cr • A n (A n ) T ] - E[a 2 • A n (A n ) J ] - E[(cj • A n (A n ) T ) T ] - E[(cr 2 • A n (A n ) T ) T ]. 

We note that E[a • A n (A n ) T ] = o ■ Id and E[a 2 • A n (A n ) T ] = o 2 ■ Id, and the same for their 
transposes, and that Il(cr • Id + o 2 ■ Id)Il > 0. So, dividing by 2 and projecting onto the II 
subspace: 

-2A 2 -nidn 

< in (a ■ A n {A n ) T + o 2 ■ A n (A n ) T + (a ■ A n (A n ) T ) T + (a 2 ■ A n (A n ) T ) T ) n. 

We turn to (B.2). By a Chernoff bound, (A, A) = Q(n 3 ) with probability 1 - n _i0 °. Let 
n £ S 3 be a nontrivial permutation. To each multi-index a with \a\ — 3 we associate its 
orbit O a under {n). If a has three distinct indices, then \O a \ > 1 and £p e o a ApA™ is a random 
variable X a with the following properties: 

• |X tt | < O(logn) with probability 1 - n _a, d) 
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• X a and —X a are identically distributed. 


Next, we observe that we can decompose 

(A,A n ) = Y J A«K = R + Yu X « 

|a|=3 O a 


where R is the sum over multi-indices a with repeated indices, and therefore has \R\ = 0(n 2 ) 
with probability 1 - n~ wo . By a standard Chernoff bound, | £ j0a X a | = 0{n 2 ) with probability 
1 - O(n~ 10 °). By a union bound over all n, we get that with probability 1 - O(n -10 °), 

(A, Yj a ") = n 3 - 0(n 2 ) = Q(n 3 ), 

neS 3 


establishing (B.2). 

Next up is (B.3). Because A n are identically distributed for all n £ S 3 we assume 
without loss of generality that A n = A. The matrix Id sym has 0(n 2 ) nonzero entries. Any 
individual entry of AA T is with probability 1 - n _a ’d) a t most 0(n). So (Id sym , AA T ) = 0(n 3 ) 
with probability 1 - 0(n~ m ). 

Next, (B.4). As before, we assume without loss of generality that n is the trivial 
permutation. For fixed 1 < i < n, we have (Id„ x „,A,) = YLjA^j, which is a sum of n 
independent unit Gaussians, so |(Id )IXf7 , A,)| < 0( yfn log n) with probability 1 - n _a,(1) . By a 
union bound over i this also holds for max, |(Id„ xn , A/)|. Thus with probability 1 - O(n _10 °), 


n pax 


1 

n 3 / 2 A 


(Id 


nxn / 



Q(i) 

A 


Last up are (B.5) and (B.6). Since we will do a union bound later, we fix i, j < n. Let 
w £ 1R”" be the matrix flattening of Id„ x „. We expand £ ||x|| 2 X;X ; - as 

£ \\xfxiXj = 9 J. 19 , (ze T n(AA T + a ■ AA T + a 2 • AA t )1T(c ; - ® ej) 

n z (J{A z ) 

+ ze T n(AA T + o ■ AA t + a 2 ■ AA T ) T Il(ej <S> ej)). 

We have The = w and we let e,j := n(e, ® ej) = ® e ; - + ej ® ei). So using Lemma 6.8, 

n 2 0(A 2 ) £ ||x|| 2 x ; xy = tv t (AA t + o ■ AA T + a 2 ■ AA T )etj 
+ w t (AA t + a ■ AA t + a 2 ■ AA T ) T e {] 

= w T (AA T eij 

+ - ^ A k ej <8> A k ei + - ^ A k ei ® A k ej 

k k 

+ \Tj A * e i ® A * ei + \Tj A * ei ® A k e i 

k k 

1 % 1 m 

+ 2 Yu AkCl ® A * e i + 2 X Ak6i 0 A * 6i 


53 







1 % 1 . 

+ 2 Yj A l e i ® At*; + 2 X AA 0 At<?i) ■ 

k k 

For i =£ j, each term iv T {Aye j ® Ake { ) (or similar, with various transposes) is the sum of n 
independent products of pairs of independent unit Gaussians, so by a Chernoff bound 
followed by a union bound, with probability 1 - n -& h) a \\ G f them are 0( \fn log n ). There 
are O(n) such terms, for an upper bound of 0(n 3/,2 (log;/)) on the contribution from the 
tensored parts. 

At the same time, w T A is a sum YLk a kk of n rows of A and Ae,/ is the average of 
two rows of A; since i ± j these rows are independent from zv T A. Writing this out, 
iv T AA T eij = \ Yjk( a kk r ciij + ciji). Again by a standard Chernoff and union bound argument 
this is in absolute value at most 0(;; 3/,2 (log n)) with probability 1 - >r" ( A In sum, when 
i + j, with probability at least 1 - n~ w ^\ we get | X ||.r|| 2 x,x ; | = 0(l/n 2 log n). After a union 
bound, the maximum over all i, j is 0(1 In 1 ). This concludes (B.5). 

In the i = j case, since Xif w r A^® A*^) = X j,k( e j> Akef 2 is a sum of n 1 independent square 
Gaussians, by a Bernstein inequality, | ^(zc, A^e,® A^e,) -n 2 | < 0{n log 1/2 n ) with probability 
ph e same holds for the other tensored terms, and for w T AA T eu, so when i = j we get 
that |0(A 2 ) X IM| 2 x 2 - 5| < 0((log 1/2 n)/n) with probability 1 - Summing over all i, we 
find that |0(A 2 ) X ||x|| 4 - 5n\ < 0( log 1/2 n), so that 0(A 2 )| X ||x|| 2 x 2 - \ X ||x|| 4 | < 0((log 1/2 n)/n) 


with probability 1 — n A union bound over i completes the argument. □ 

Lemma B.12. Let Abe a random 4-tensor with unit Gaussian entries. There is A 2 = 0(n) so 
that when X : IR[x] 4 —» R is the linear operator whose matrix representation Mx is given by 
Air := -tt 2 Yjnes 4 ATZ ' probability 1 - 0 (n~ 5 °) the following events all occur for every n e S 4 . 

-A 2 < La- + IA-yi (B.7) 

(A, X A”> = Q(n 4 ) (B.8) 

7TG^4 

(id s y m ,A n ) = o(a 2 V^) (b.9) 

n 2 max | X ||x|| 2 X/X/| = 0(1) (B.10) 

n 3/1 max | X IM| 2 x 2 | = 0(1). (B.ll) 

i 

Proof For (B.7), we note that \A n + ( A n ) T is an n 2 x n 2 matrix with unit Gaussian entries. 


Thus, by Lemma B.3, we have \\\A n + (A 71 ) 7 !] < 0(n ) = 0(A). For (B.8) only syntactic 
changes are needed from the proof of (B.3). For (B.9), we observe that (Id sym , A n ) is a sum 
of 0(n 2 ) independent Gaussians, so is 0(n log n) < 0(A 2 yfn) with probability 1 - O(n _10 °). 
We turn finally to (B.10) and (B.ll). Unlike in the degree 3 case, there is nothing special here 
about the diagonal so we will able to bound these cases together. Fix i, j < n. We expand 
X ||x|| 2 x,xy as TjneSy w> T A n {ei <S> ej). The vector A n (ei <S> ej) is a vector of unit Gaussians, so 
w T A n (ei <g> ej) = 0( yfn log n) with probability 1 - n _ "9) Thus, also with probability 1 - n~ w( - 1 ) / 
we get n 2 maxy | X ||x|| 2 x z xy| = 0(1), which proves both (B.10) and (B.ll). □ 
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